Upload
doannguyet
View
238
Download
1
Embed Size (px)
Citation preview
Hot Fuzz: using SAS “fuzzy” data matching
techniques to identify duplicate and
related customer records.
Stuart Edwards, Collection House Group
• Debt Collection; purchased debt, receivables and outsourced collections.
• Banking and Finance, Insurance, Government, Telecom and Utility debt.
• More than 250,000 purchased accounts currently active; aggregate face
value exceeding $5 Billion.
• Call-centre based; CSOs locate customers and negotiate an outcome.
Page 2
Collection House Group; what do we do?
Page 3
Skip tracing; the art of finding people.
• Debt usually sold / outsourced as the customer has defaulted and “skipped”.
• New “Leads” must be identified and worked to contact the customer.
• Leads data can be gathered from multiple providers.
• Our database of existing or previous customers is a free source of leads
• Cross-referencing is employed to search this data for linkages.
? +
Page 4
Why do we perform cross-referencing?
• Many customers have multiple debts.
• Makes skip-tracing more efficient and more timely.
• Fuzzy-matching uncovers non-standard matches – e.g. name changes (due
to marriage), mis-spellings and co-habitants.
• Linking accounts enables consistent negotiation.
• Cross-referencing results may indicate a customer’s propensity to pay.
Customer 1Customer 2
Loan
Account 2
Motor Loan
Account 7Telco
Account 4
Credit Card
Account 1
Utility
Account 3PrimaryCo-borrower
PrimaryGuarantor07 1234 5678
<Unknown>
Page 5
What attributes help us to determine a match?
Name: J Smith
DOB: 12/05/1968
10 Queen St, Brisbane
Name: John Smith
DOB: 05/12/1968
4 Merthyr Rd, New Farm
?• Need to identify both perfect and potential matches.
• Input data can be of varying quality!
• There are a number of attributes upon which we can match.
First NameMiddle Name(s)Surname
Maiden NameMonikersInitials
AddressDPIDX/Y Coordinates
Drivers LicenceCredit Bureau Reference
Telephone number(s)Email address
Date of BirthInverted DD / MM
SAS offers a range of methods for matching data.
Page 6
Method Benefits Disadvantages
data c1;
merge a1 b1;
by <Common Variable>;
Data step match-merge. Simple, Base SAS syntax.
Fast execution.
Not suited to many-to-
many matches.
proc sql;
create table c1 as
select a1.*, b1.*
from a1 left join b1 on
(<Common Variable>);
Proc SQL join. Generates a Cartesian
product; good for many-
to-many joins.
SAS Data Flux. Optimised for this task.
Matching logic in-built.
More expensive than our
budget for the features
we required.
data c1;
declare Hash HX (dataset:
"Match_Data"
.......etc
Hash table matching. In memory execution.
Enables matching on
multiple key variables.
Slow when merging large
data sets.
Can fill up disk space if
not optimised.
Complex syntax to write
and debug.
Can fill up memory with
large data sets.
Page 7
Hash matching was chosen as the preferred
strategy.*Match Base_Data to Match_Data via Hash Table;
data Match_Output;
*Declare the Hash key;
declare Hash HX (dataset: "Match_Data", multidata: 'y',
hashexp: 12);
rc = HX.DefineKey ("KEY1");
rc = HX.DefineData (<Match Variables to Write>);
rc = HX.DefineDone ();
*Populate the hash table;
do until (eof_hash);
set Match_Data end = eof_hash;
rc = HX.add ();
end;
*Loop through the records in the base data set;
do until (eofX) ;
set Base_Data end = eofX;
rc = HX.find ();
if (rc = 0) then do;
if <Crude Filter Conditions> then output;
HX.has_next(result: r);
do while(r ne 0);
rc = HX.find_next();
if <Crude Filter Conditions> then output;
HX.has_next(result: r);
end;
end;
end;
stop;
run;
• Many-to-many matching is required.
• Data volumes are large – 1M+ records
matched against 1M+ records.
• In memory processing – reliable with
acceptable speed.
• Matching is performed on common
variables.
• Crude filters applied to eliminate obvious
non-matches.
Page 8
Multiple keys are generated from transformed attributes to
perform the hash matches.
• The Soundex function is used to give similar
sounding names a “fuzzy” value.
• Multiple key values are combined.
• The MD5 function generates a Hash Key.
• Hash key is read into memory and records are
matched on the key.
• Multiple hash results combined by SQL Union.
-
SX_Name = soundex(<Name>);
Key_Name = put(md5(Key_Cat),$hex8.);
rc = HX.DefineKey (“Key_Name");
proc sql;
create table Hash_Combined as
select * from Hash_Result1
UNION
select * from Hash_Result2
;
quit;
JOHN-PAUL
JONNY PAOLOJ514
Key_Cat = catt(SX_Name,SX_Surname);
J514T2435253 BC590E6B
Page 9
Filtering is undertaken to remove unlikely match pairs.
• Crude filtering applied to exclude obvious non-matches.
• PAF data is used to cleanse and validate addresses.
o DPID (premises-specific delivery ID) used in Key generation.
o X/Y Coordinates, to calculate geographical distance.
ID ID_X Key Key Type Name Name_X Address Address_X Distance Score Match Type
46873 46132 1FGH8 Surname +
Initial
Robert
Jones
R Jones 1 King St,
Sydney
1 King St,
Sydney
46873 45944 2J9KX Name + DOB Robert
Jones
Robbie
Jones
1 King St,
Sydney
5 New Rd,
Pyrmont
46873 40185 J6R7F Name + DPID Robert
Jones
Mary
Jones
1 King St,
Sydney
1 King St,
Sydney
46873 42922 L13A9 Name Robert
Jones
Bob
Jones
1 King St,
Sydney
18 Old St,
Bondi
0 km
2.1 km
0 km
6.2 km
OHN D
Page 10
Enterprise Miner is used to derive a propensity-to-
match score using available points of ID.
• Customer records with many points of ID are
used to seed a model.
• Credit Bureau data verifies the match and
sets the Target variable (match or no-
match).
• Points of ID are selectively removed.
• Hash matching conducted on sampled
data.
• Propensity-to-match model is built on the
validated target variable.
J
28/05/1975
0412 345 678
5 Wombat StKangaroo PtQLD 4069
QLD1678904
QLD 4075
7 Echidna RdSherwood
SMITH
Page 11
Enterprise Miner is used to derive a propensity-to-
match score using available points of ID.
• Results of model nodes determine the
worth of input variables.
• Models are compared and the best is used
to generate score code.
• Score logic applied to records as given to
deliver a match likelihood score.
ID ID_X Key Key Type Name Name_X Address Address_X Distance Score Match Type
46873 46132 1FGH8 Surname +
Initial
Robert
Jones
R Jones 1 King St,
Sydney
1 King St,
Sydney
0 km
46873 45944 2J9KX Name + DOB Robert
Jones
Robbie
Jones
1 King St,
Sydney
5 New Rd,
Pyrmont
2.1 km
46873 40185 J6R7F Name + DPID Robert
Jones
Mary
Jones
1 King St,
Sydney
1 King St,
Sydney
0 km
46873 42922 L13A9 Name Robert
JonesBob
Jones1 King St,
Sydney18 Old St,
Bondi6.2 km
920
615
276
102
EXACT
PROBABLE
FAMILY
POSSIBLE
Page 12
Filtered results are displayed to our Collections Agents
through the CRM.
• Pertinent information is displayed to CSOs for assessment.
Type: Account: 46021358721
Name: ROBERT JONES DOB: 12/04/1977
Address: 1 KING ST, SYDNEY, NSW 2000 D/L: N/A
Home Ph: 02 2134 5768 Work Ph: N/A Mobile Ph: N/A
Client: BIG FINANCE (BF1234) Product: CREDIT CARD
Balance: $5,241 Last Paid: - Last Contact: 15/03/2015
Current Intention: Located, no active commitment.
Type: Account: 46021358721
Name: ROBBIE JONES DOB: 12/04/1977
Address: 5 NEW RD, PYRMONT, NSW 2006 D/L: NSW268427
Home Ph: N/A Work Ph: N/A Mobile Ph: 0411 223 344
Client: POWER2 (PT5678) Product: ELECTRICITY SUPPLY
Balance: $0 Last Paid: 12/01/2015 Last Contact: 12/01/2015
Current Intention: Account Paid in Full.
Page 13
Project successes and lessons learned.
• Time for more than 20 FTE saved by eliminating manual XREF.
• New cross-reference approach identified new match pairs.
• Filtering to balance between too-many and too-few “possible” matches.
• Performing many-to-many matches on large datasets can be resource
hungry and result in crashes.
• Hash matching is the most reliable solution.
• Enterprise Miner derived score models allowed more qualitative assessment.
Page 14
References to useful Hash Table resources.
• SAS Hash Object Programming Made Easy.
Michele M. Burlew (SAS Press, 2012)).
• Think FAST! Use Memory Tables (Hashing) for Faster Merging.
SUGI 31,Paper 244-31 (Gregg Snell, Data Savant Consulting, KS)
• An Introduction to Hash Tables.
SAS Canada User Groups, 2013 (Shaun Kaufmann, Farm Credit Canada).
• Table Look-Up by Direct Addressing: Key-Indexing – Bitmapping – Hashing.
SUGI 26,Paper 8-26 (Paul M. Dorfman, CitiCorp AT&T Universal Card, Jacksonville, FL)
• Choosing the Right Technique to Merge Large Data Sets Efficiently.
SUGI 26,Paper 071-2009 (Qingfeng Liang, Community Care Behavioral Health
Organization, Pittsburgh, PA)
Page 15
Questions?
Stuart Edwards
Head of Analytics, Collection House Group