Data Quality Analytics: Understanding what is in your data, before using it

Data Quality Analytics:

Scott Murdoch, PhD

Understanding what is in your data, before using it.

What is Data Quality

Analytics?

Why is it needed – Cost of

Dirty Data

Dirty data can cost any company

productivity, brand perception, and

most importantly revenue

AGENDA

Understanding

your data

Implementing Data

Quality Analytics

‘Spot checking’ data is not longer

effective.

Integration with IT

is the use of distributions and

modeling techniques to understand

the pitfalls within data

Data Quality Analytics

Opportunity cost of time

Cost

[None; {% of Revenue,

Reputation, Embarrassment}]

Savings

The steps to follow are for

preprocessing not post-validation

Important

! 1

2

3

What is Data Quality Analytics?

Healthcare industry has estimated cost

of $314 billion alone for ‘dirty’ data.1

1http://www.hoovers.com/lc/sales-marketing-education/cost-of-dirty-data.html

Making decisions off ‘dirty’ data brings a

estimated cost of $3 trillion per year for

the US.1

Cost of Dirty Data

What is the problem you are trying to solve.1

Understanding your Data

So you are ready? So you think…

What type of data do you have?2

What do you really know about the data?3

This is not as easy or

straightforward as it seems.

CAUTIONIn your last data project, what

predispositions did you have

about the data? Were you right?

PAUSE

Identify Key Fields within your Data

Unique Key of The DatasetMember, Date of Service, Claim Number, Claim Line

Crucial Fields Needed For Analysis Your dependent variable, and theoretical top independents

• Allow Payment, Provider NPI, Covered Amount, CPT Code, Provider Specialty, Member Zip code

Other fieldsMedicare ID, Provider last name, Provider first name

• % missing

• % Zero

• Top 20 most frequent values

• Create histogram

• Minimum & Maximum

Compute the following

metrics for EACH

crucial field.

Start with simple metrics for benchmarking

REGRESSION

More setup, dependent

variable needed, easier

to explain

01 02 03

Advance Methods for Tracking Quality

Modeling Techniques: Regressions, Cluster, or Neural Networks, Etc.

CLUSTERING

No dependent

variable needed,

harder to explain

NEURAL NETWORKS

Dependent variable

needed, less setup,

harder to explain

Build The Best Model

Based on your choice of goodness

of fit statistics

Using crucial fields in a OLS regression

Setting up Advanced Data Quality Methods

Fields

Allow Payment, Provider NPI, Covered

Amount, CPT Code, Provider Specialty,

Member Zip code

Dependent Variable

Allow Payment$

IMPORTANT: KEEP the coefficients for the future,

this is the most important part!

K-means Clustering

Setting up Advanced Data Quality Methods

Fields

Allow Payment, Provider NPI, Covered

Amount, CPT Code, Provider Specialty

IMPORTANT: KEEP the seeds for the future,

this is the most important part!

Try building a 3-dimensional cluster

use these fields

How is the fit?

Do the groups make sense?

Integration with IT

So you have checked your data; now what?

Create marginal error range for benchmark metrics. Examples:

• Metric: % missing

• Run a random sampling without replacement using 60% of your sample, 1000 times.

• Results from samples can will serve as acceptable range.

As new data comes in, calculate these metrics, comparing them

to the acceptable range

Requires partnership with Information Technology

Integration with IT

Model results is the second, more advance, stage of integration.

Run the regression or neural network model using the

coefficients of previous data, and compare predicted fit.

Use your models as a method of validation in two ways.

1

Run new model using the same variables, and calculate

the change in coefficients.2

Data Quality Analytics:

Scott Murdoch, PhD

Understanding what is in your data, before using it.

Questions

Data & Analytics

Data Quality Analytics: Understanding what is in your data, before using it