Upload
domino-data-lab
View
96
Download
0
Embed Size (px)
Citation preview
Data Quality Analytics:
Scott Murdoch, PhD
Understanding what is in your data, before using it.
What is Data Quality
Analytics?
Why is it needed – Cost of
Dirty Data
Dirty data can cost any company
productivity, brand perception, and
most importantly revenue
AGENDA
Understanding
your data
Implementing Data
Quality Analytics
‘Spot checking’ data is not longer
effective.
Integration with IT
is the use of distributions and
modeling techniques to understand
the pitfalls within data
Data Quality Analytics
Opportunity cost of time
Cost
[None; {% of Revenue,
Reputation, Embarrassment}]
Savings
The steps to follow are for
preprocessing not post-validation
Important
! 1
2
3
What is Data Quality Analytics?
Healthcare industry has estimated cost
of $314 billion alone for ‘dirty’ data.1
1http://www.hoovers.com/lc/sales-marketing-education/cost-of-dirty-data.html
Making decisions off ‘dirty’ data brings a
estimated cost of $3 trillion per year for
the US.1
Cost of Dirty Data
What is the problem you are trying to solve.1
Understanding your Data
So you are ready? So you think…
What type of data do you have?2
What do you really know about the data?3
This is not as easy or
straightforward as it seems.
CAUTIONIn your last data project, what
predispositions did you have
about the data? Were you right?
PAUSE
Identify Key Fields within your Data
Unique Key of The DatasetMember, Date of Service, Claim Number, Claim Line
Crucial Fields Needed For Analysis Your dependent variable, and theoretical top independents
• Allow Payment, Provider NPI, Covered Amount, CPT Code, Provider Specialty, Member Zip code
Other fieldsMedicare ID, Provider last name, Provider first name
• % missing
• % Zero
• Top 20 most frequent values
• Create histogram
• Minimum & Maximum
Compute the following
metrics for EACH
crucial field.
Start with simple metrics for benchmarking
REGRESSION
More setup, dependent
variable needed, easier
to explain
01 02 03
Advance Methods for Tracking Quality
Modeling Techniques: Regressions, Cluster, or Neural Networks, Etc.
CLUSTERING
No dependent
variable needed,
harder to explain
NEURAL NETWORKS
Dependent variable
needed, less setup,
harder to explain
Build The Best Model
Based on your choice of goodness
of fit statistics
Using crucial fields in a OLS regression
Setting up Advanced Data Quality Methods
Fields
Allow Payment, Provider NPI, Covered
Amount, CPT Code, Provider Specialty,
Member Zip code
Dependent Variable
Allow Payment$
IMPORTANT: KEEP the coefficients for the future,
this is the most important part!
K-means Clustering
Setting up Advanced Data Quality Methods
Fields
Allow Payment, Provider NPI, Covered
Amount, CPT Code, Provider Specialty
IMPORTANT: KEEP the seeds for the future,
this is the most important part!
Try building a 3-dimensional cluster
use these fields
How is the fit?
Do the groups make sense?
Integration with IT
So you have checked your data; now what?
Create marginal error range for benchmark metrics. Examples:
• Metric: % missing
• Run a random sampling without replacement using 60% of your sample, 1000 times.
• Results from samples can will serve as acceptable range.
As new data comes in, calculate these metrics, comparing them
to the acceptable range
Requires partnership with Information Technology
Integration with IT
Model results is the second, more advance, stage of integration.
Run the regression or neural network model using the
coefficients of previous data, and compare predicted fit.
Use your models as a method of validation in two ways.
1
Run new model using the same variables, and calculate
the change in coefficients.2
Data Quality Analytics:
Scott Murdoch, PhD
Understanding what is in your data, before using it.
Questions