15
Hot Fuzz: using SAS “fuzzy” data matching techniques to identify duplicate and related customer records. Stuart Edwards, Collection House Group

Hot Fuzz: using SAS “fuzzy” data matching techniques to ... Groups/QUES… · Hot Fuzz: using SAS “fuzzy”data matching techniques to identify duplicate and related customer

Embed Size (px)

Citation preview

Hot Fuzz: using SAS “fuzzy” data matching

techniques to identify duplicate and

related customer records.

Stuart Edwards, Collection House Group

• Debt Collection; purchased debt, receivables and outsourced collections.

• Banking and Finance, Insurance, Government, Telecom and Utility debt.

• More than 250,000 purchased accounts currently active; aggregate face

value exceeding $5 Billion.

• Call-centre based; CSOs locate customers and negotiate an outcome.

Page 2

Collection House Group; what do we do?

Page 3

Skip tracing; the art of finding people.

• Debt usually sold / outsourced as the customer has defaulted and “skipped”.

• New “Leads” must be identified and worked to contact the customer.

• Leads data can be gathered from multiple providers.

• Our database of existing or previous customers is a free source of leads

• Cross-referencing is employed to search this data for linkages.

? +

Page 4

Why do we perform cross-referencing?

• Many customers have multiple debts.

• Makes skip-tracing more efficient and more timely.

• Fuzzy-matching uncovers non-standard matches – e.g. name changes (due

to marriage), mis-spellings and co-habitants.

• Linking accounts enables consistent negotiation.

• Cross-referencing results may indicate a customer’s propensity to pay.

Customer 1Customer 2

Loan

Account 2

Motor Loan

Account 7Telco

Account 4

Credit Card

Account 1

Utility

Account 3PrimaryCo-borrower

PrimaryGuarantor07 1234 5678

<Unknown>

Page 5

What attributes help us to determine a match?

Name: J Smith

DOB: 12/05/1968

10 Queen St, Brisbane

Name: John Smith

DOB: 05/12/1968

4 Merthyr Rd, New Farm

?• Need to identify both perfect and potential matches.

• Input data can be of varying quality!

• There are a number of attributes upon which we can match.

First NameMiddle Name(s)Surname

Maiden NameMonikersInitials

AddressDPIDX/Y Coordinates

Drivers LicenceCredit Bureau Reference

Telephone number(s)Email address

Date of BirthInverted DD / MM

SAS offers a range of methods for matching data.

Page 6

Method Benefits Disadvantages

data c1;

merge a1 b1;

by <Common Variable>;

Data step match-merge. Simple, Base SAS syntax.

Fast execution.

Not suited to many-to-

many matches.

proc sql;

create table c1 as

select a1.*, b1.*

from a1 left join b1 on

(<Common Variable>);

Proc SQL join. Generates a Cartesian

product; good for many-

to-many joins.

SAS Data Flux. Optimised for this task.

Matching logic in-built.

More expensive than our

budget for the features

we required.

data c1;

declare Hash HX (dataset:

"Match_Data"

.......etc

Hash table matching. In memory execution.

Enables matching on

multiple key variables.

Slow when merging large

data sets.

Can fill up disk space if

not optimised.

Complex syntax to write

and debug.

Can fill up memory with

large data sets.

Page 7

Hash matching was chosen as the preferred

strategy.*Match Base_Data to Match_Data via Hash Table;

data Match_Output;

*Declare the Hash key;

declare Hash HX (dataset: "Match_Data", multidata: 'y',

hashexp: 12);

rc = HX.DefineKey ("KEY1");

rc = HX.DefineData (<Match Variables to Write>);

rc = HX.DefineDone ();

*Populate the hash table;

do until (eof_hash);

set Match_Data end = eof_hash;

rc = HX.add ();

end;

*Loop through the records in the base data set;

do until (eofX) ;

set Base_Data end = eofX;

rc = HX.find ();

if (rc = 0) then do;

if <Crude Filter Conditions> then output;

HX.has_next(result: r);

do while(r ne 0);

rc = HX.find_next();

if <Crude Filter Conditions> then output;

HX.has_next(result: r);

end;

end;

end;

stop;

run;

• Many-to-many matching is required.

• Data volumes are large – 1M+ records

matched against 1M+ records.

• In memory processing – reliable with

acceptable speed.

• Matching is performed on common

variables.

• Crude filters applied to eliminate obvious

non-matches.

Page 8

Multiple keys are generated from transformed attributes to

perform the hash matches.

• The Soundex function is used to give similar

sounding names a “fuzzy” value.

• Multiple key values are combined.

• The MD5 function generates a Hash Key.

• Hash key is read into memory and records are

matched on the key.

• Multiple hash results combined by SQL Union.

-

SX_Name = soundex(<Name>);

Key_Name = put(md5(Key_Cat),$hex8.);

rc = HX.DefineKey (“Key_Name");

proc sql;

create table Hash_Combined as

select * from Hash_Result1

UNION

select * from Hash_Result2

;

quit;

JOHN-PAUL

JONNY PAOLOJ514

Key_Cat = catt(SX_Name,SX_Surname);

J514T2435253 BC590E6B

Page 9

Filtering is undertaken to remove unlikely match pairs.

• Crude filtering applied to exclude obvious non-matches.

• PAF data is used to cleanse and validate addresses.

o DPID (premises-specific delivery ID) used in Key generation.

o X/Y Coordinates, to calculate geographical distance.

ID ID_X Key Key Type Name Name_X Address Address_X Distance Score Match Type

46873 46132 1FGH8 Surname +

Initial

Robert

Jones

R Jones 1 King St,

Sydney

1 King St,

Sydney

46873 45944 2J9KX Name + DOB Robert

Jones

Robbie

Jones

1 King St,

Sydney

5 New Rd,

Pyrmont

46873 40185 J6R7F Name + DPID Robert

Jones

Mary

Jones

1 King St,

Sydney

1 King St,

Sydney

46873 42922 L13A9 Name Robert

Jones

Bob

Jones

1 King St,

Sydney

18 Old St,

Bondi

0 km

2.1 km

0 km

6.2 km

OHN D

Page 10

Enterprise Miner is used to derive a propensity-to-

match score using available points of ID.

• Customer records with many points of ID are

used to seed a model.

• Credit Bureau data verifies the match and

sets the Target variable (match or no-

match).

• Points of ID are selectively removed.

• Hash matching conducted on sampled

data.

• Propensity-to-match model is built on the

validated target variable.

J

28/05/1975

0412 345 678

5 Wombat StKangaroo PtQLD 4069

QLD1678904

QLD 4075

7 Echidna RdSherwood

SMITH

Page 11

Enterprise Miner is used to derive a propensity-to-

match score using available points of ID.

• Results of model nodes determine the

worth of input variables.

• Models are compared and the best is used

to generate score code.

• Score logic applied to records as given to

deliver a match likelihood score.

ID ID_X Key Key Type Name Name_X Address Address_X Distance Score Match Type

46873 46132 1FGH8 Surname +

Initial

Robert

Jones

R Jones 1 King St,

Sydney

1 King St,

Sydney

0 km

46873 45944 2J9KX Name + DOB Robert

Jones

Robbie

Jones

1 King St,

Sydney

5 New Rd,

Pyrmont

2.1 km

46873 40185 J6R7F Name + DPID Robert

Jones

Mary

Jones

1 King St,

Sydney

1 King St,

Sydney

0 km

46873 42922 L13A9 Name Robert

JonesBob

Jones1 King St,

Sydney18 Old St,

Bondi6.2 km

920

615

276

102

EXACT

PROBABLE

FAMILY

POSSIBLE

Page 12

Filtered results are displayed to our Collections Agents

through the CRM.

• Pertinent information is displayed to CSOs for assessment.

Type: Account: 46021358721

Name: ROBERT JONES DOB: 12/04/1977

Address: 1 KING ST, SYDNEY, NSW 2000 D/L: N/A

Home Ph: 02 2134 5768 Work Ph: N/A Mobile Ph: N/A

Client: BIG FINANCE (BF1234) Product: CREDIT CARD

Balance: $5,241 Last Paid: - Last Contact: 15/03/2015

Current Intention: Located, no active commitment.

Type: Account: 46021358721

Name: ROBBIE JONES DOB: 12/04/1977

Address: 5 NEW RD, PYRMONT, NSW 2006 D/L: NSW268427

Home Ph: N/A Work Ph: N/A Mobile Ph: 0411 223 344

Client: POWER2 (PT5678) Product: ELECTRICITY SUPPLY

Balance: $0 Last Paid: 12/01/2015 Last Contact: 12/01/2015

Current Intention: Account Paid in Full.

Page 13

Project successes and lessons learned.

• Time for more than 20 FTE saved by eliminating manual XREF.

• New cross-reference approach identified new match pairs.

• Filtering to balance between too-many and too-few “possible” matches.

• Performing many-to-many matches on large datasets can be resource

hungry and result in crashes.

• Hash matching is the most reliable solution.

• Enterprise Miner derived score models allowed more qualitative assessment.

Page 14

References to useful Hash Table resources.

• SAS Hash Object Programming Made Easy.

Michele M. Burlew (SAS Press, 2012)).

• Think FAST! Use Memory Tables (Hashing) for Faster Merging.

SUGI 31,Paper 244-31 (Gregg Snell, Data Savant Consulting, KS)

• An Introduction to Hash Tables.

SAS Canada User Groups, 2013 (Shaun Kaufmann, Farm Credit Canada).

• Table Look-Up by Direct Addressing: Key-Indexing – Bitmapping – Hashing.

SUGI 26,Paper 8-26 (Paul M. Dorfman, CitiCorp AT&T Universal Card, Jacksonville, FL)

• Choosing the Right Technique to Merge Large Data Sets Efficiently.

SUGI 26,Paper 071-2009 (Qingfeng Liang, Community Care Behavioral Health

Organization, Pittsburgh, PA)

Page 15

Questions?

Stuart Edwards

Head of Analytics, Collection House Group

[email protected]