IDQ Notes.docx

Data Quality Architecture consists mainly of 4 layers –

1) Client Layer2) Server Layer3) Metadata4) Content Layer

We have 2 clients in the Client Layer namely

1) Informatica Analyst – It’s a web based client used to carry out Data Quality Analysis.2) Informatica Developer – It’s a client tool used to carry out Data Quality activities.

All services under Data Quality are used to serve the requests from the Client tools Analyst and Developer. The services that we use are

Analyst Service – Used to perform Web based data quality activities. It needs to be up and running always.

Model Repository Service – used to take care of repository activities, metadata, tables, structures etc. It needs to be up and running always.

Data Integration Service – used to move data from one transformation to another. It needs to be up and running always.

Content Management Service – Used to provide address content or Identity content

These services are created in the admin console.

Metadata Layer which is the Model repository where the Metadata is stored. This is a central repository meaning any change we do in Analyst tool will be visible in Developer tool and vice versa. Note this repository is different from the PowerCenter/Power Exchange Repository.

Reference Data – In order to perform any standardization like str, street, st. etc. that is simple find and replace method, we use reference data.

Address Data – Holds the Address details like Master table for reference.

Example notes –

Master Record:

Merger

Let’s say we have 2 banks B1 and B2 and Bank B1 has acquired Bank B2.

Name Address

B1 abc #111, sname, city, zip

B2 abc1 null

After Merger:

We need to identify the duplicates, identify matching records to identify same customers. To do this matching in Data quality we have 2 strategies i.e, exact matching and probable matching.

In case of exact, if we consider name as a column, then the names don’t match. So there are no duplicates.

Then we use probable match where in matching abc with abc1 would be 80% or 70% match. These records are considered matches or duplicate records.

Next step is creation of good data as a result of these 2 matches which is called the Master Record.

Consolidation – these 2 records will be merged into a single record. We can define consolidation rules here.

Rule 1: which columns are longer in length. Let’s say Full name as I/p column. Name with more no. of chars are called good data.

abc1 #111, sname, city, zip -> Final result of data quality tool.

Other data quality tools are Trillium, SAS Data flux, IBM Infosphere quality, Address Doctor, First Logic. Address doctor is part of IDQ since Informatica acquired it.

Physical Data Objects: - can be brought in by importing them from Relational Databases or Flat files.

Or by creating them yourself as either relational or flat files. And they can be used for reading in or writing out data in a Mapping and Mappings can have transformational logic to modify the data.

Fig. below shows screenshot of Informatica Developer and on the right hand side, we have the Database connections.

Join Analysis Profile

Let’s assume we’ve 3 sources Customer_Shipping, Orders, Order_Details. I’m a customer and I buy dog food so there would be an order and the details of that order would be in Order_Details.

The goal of the example is that can we create a Master Source File.

Without any data analysis, can we analyze this?

Is there a unique (primary/foreign) key that makes sense to join this data.

Let’s look at Customers & Orders? Any orphans? Can we have orders without Customers?

Let’s look at Orders and Order Details? Same questions. I.e. Are Orphans allowed? Could we have orders without order details? Or could we have Order details without Orders? Or Orders and Order Details without Customers?

There is an easier way in IDQ to do this called Join Analysis:-

Right click and add a Transformation called Joiner. Let’s say we are studying Order Details and Orders. So in left hand side, right click on Order Details and click Profile.

Notice what comes up in the next wizard. It shows we can do Multiple Profiles, Profile and Profile Model.

In this case we do Profile Model.

Generic Data Profiling:-

The way to create a Physical data object is right click on Physical Data Object under Project Folder and click Create Physical Object (PDO).

PDO – where am I reading data from? Where is the data itself? Where am I going to write the results to? Commonly known as the source/Target except that we call as PDO. They can Relational, flat file, Non-Relational, SAP or Web Services data objects.

Documents

IDQ Notes.docx