Healthcare fraud detection

FRAUD DETECTIONBIG DATA ANALYSIS (HEALTHCARE APPLICATION)

MAHDI ESMAILOGHLI [email protected] BIGDATA.CEIT.AUT.AC.IR

mailto:[email protected]?subject=

WHAT IS A FRAUD???

“… any illegal act characterized by deceit, concealment, or violation of trust. These acts are not dependent upon the threat of violence or physical force. Frauds are perpetrated by parties and organizations to obtain money, property, or services; to avoid payment or loss of services; or to secure personal or business advantage.”

International Professional Practices Framework(IPPF)

DEFINITION

FRAUDWHERE COULD BE FOUND…

DOMAIN OF APPLICATION

WHERE FRAUD COULD BE FOUND?

▸ HealthCare Systems

▸ Credit Cards Domain

▸ Social Networks

▸ Satellite Or Army Systems Controlling

▸ …

HEALTHCAREFRAUD IN

DIFFERENCES

WHAT IS THE CHARACTERISTICS OF HEALTHCARE DOMAIN DATA?

▸ Complexity and number of fields in these kind of data are tremendous.

▸ The people or organizations attends to make profit to others.

▸ Data is really BIG and sometimes stream

▸ Many kinds of data like: Image, Raw Text, Sound, …

▸ Data are not labeled and hard to classification

▸ Concept drifting

SOME TIPS ABOUT IMPORTANCE OF

BigData in HealthCare

TIPS

ROLE OF BIG DATA IN HEALTHCARE

▸ DNA. One of the most important public datasets in Amazon.

▸ Stanford’s BigData conference is all about Healthcare

▸ Microsoft has stablished an academic part to work on healthcare

▸ Loss of money in many countries because of FRAUD in healthcare (up to 10% US annual health care expenditure)

EXAMPLES

SOME FRAUDS THAT TRADITIONAL HEALTHCARE SYSTEMS USED TO FACE WITH

▸ Changing patient’s insurance identification document

▸ Prescribing some fixed brands of drugs by a Dr

▸ Prescribing expensive drugs than what is usual for same disease

▸ getting some kinds of drugs by a patient more than usual

▸ and many more…

HEALTHCAREMETHODS OF FRAUD DETECTION IN

SOLUTIONS

DETECTING HEALTHCARE FRAUD

▸ Statistical

▸ Machine learning and Data mining

▸ Graph analysis

STATISTICAL METHODSFRAUD DETECTION USING

STATISTICAL METHODS

STATISTICAL METHODS…

▸ Uses some rules

▸ Rules are described by a domain expert

▸ Creating application to initial statistical parameters ex:

▸ Count average of drugs in every prescription

▸ Total price of every disease

▸ Then they can be compared with new data. If high difference found, ALARM GOES OFF

STATISTICAL METHODS

CONS AND PROS

▸ It’s very simple and easy to implement

▸ Low computation overhead

▸ Very easy to use for stream data

▸ Low flexibility

▸ Can’t be used for data concept drifting

▸ Adding rules is hard

▸ Every thing is based on domain expert knowledge

▸ It’s possible that defined solution wouldn’t be complete

MACHINE LEARNING AND DATA MINING ALGORITHMS

FRAUD DETECTION USING

MACHINE LEARNING AND DATA MINING ALGORITHMS

MACHINE LEARNING ALGORITHMS

▸ Choosing one or more machine learning algorithm based on the data

▸ Use them for learning and detecting frauds

▸ If (data are labeled) classification is perfect idea

▸ Else clustering

▸ Or using clustering to labeling and the using classifications

GRAPH ANALYSISFRAUD DETECTION USING

GRAPH BASED FRAUD DETECTION

GRAPH ANALYSIS

▸ It has been going popular since 2015

▸ It’s still just a assistant system to get along with machine learning algorithms

▸ It can’t consider all aspects

▸ But handy

USING PAGE RANK TO HEALTHCARE FRAUD DETECTION

HORTON WORKS

USING PAGE RANK TO HEALTHCARE FRAUD DETECTION

DATA FIELDS

▸ NPI (National Provider Id)

▸ Speciality

▸ Procedure Code

▸ Count

PERSONALIZED PAGE RANKPAGE RANK AND

EXAMPLE

PAGE RANK ON DATA 13.5% 13.5%

9.5%

13.3%

17.6%

9.5%13.7%

9.5%

DermatologistSurgeonInternist

EXAMPLE

PERSONALIZED PAGE RANK ON DERMATOLOGIST SPECIALITY24.1%

24.1%

18.7%

15.4%

7.9%

2.9%4.1%

2.9%


GRAPH BASED FRAUD DETECTION

ENVIRONMENT OF THE PAPER

ENVIRONMENT OF PAPER

ENVIRONMENT OF PAPER

▸ Dataset: CMS Medicare Part-B

▸ Used Apache HADOOP and Apache Pig

▸ 8 nodes

▸ 4 cores for each node

▸ 64 GB of memory for each node

▸ Total time of execution: 3 hours

STEPS OF THE ALGORITHM

Step 1

STEP 1

COMPUTE THE SIMILARITY BETWEEN PROVIDERS

▸ Computing similarities between providers based on shared procedure

▸ If similarity of two providers are more than a threshold an edge connects them

▸ Sensitive Hashing & DimSum can help but it didn’t use

▸ 880K providers => 774 billion similarity computation

▸ My dataset: ~140 providers => 20K similarity computation

STEPS OF THE ALGORITHM

Step 2

STEP 2

COMPUTING PERSONALIZED PAGE RANK FOR EACH SPECIALITY

▸ Loop over all specialities

▸ For each speciality apply Personalized Page Rank to the graph

▸ Identify anomalous providers: PRSpeciality(node) high but whose whose speciality is not the one used for the page rank calculation

EXAMPLE

PERSONALIZED PAGE RANK ON DERMATOLOGIST SPECIALITY24.1%

24.1%

18.7%

15.4%

7.9%

2.9%4.1%

2.9%


IMPLEMENTATION ON APACHE SPARK

OUR ANALYSIS

SPARK IMPLEMENTATION

WHAT WE DID IN SPARK

▸ Implementation from the scratch

▸ Changing the algorithm of page rank in Spark GraphX

▸ Every Personalized Page Rank runs 100 loops

▸ Dataset contains 20,000 raw data

▸ It took 20 minutes to run the algorithm on a core i7, 4core macbook Pro with 4GB memory (main part of memory occupied by OS)

SOME RESULT OF FRAUD DETECTION

RESULTS

ALGORITHM SPEED ANALYSIS

SPEED ANALYSIS BASED ON ITERATION COUNT

0

175

350

525

700

10 25 50 75 100

68 120249

462

690

SOLUTION ANALYSIS

SOLUTION ANALYSIS

CONS. AND PROS.

▸ Algorithm need computing similarity for all pairs of providers.

▸ It just consider one aspect of the fraud. Not complete

▸ Low speed & needs huge amount of memory (because of computing similarity at first) - 2GB data needs 512 GB Ram

▸ Hard to add new data and update the graph

▸ High cost of part 2

▸ Needs to define rules to use graph analysis (other papers)

SOLUTION ANALYSIS

CONS. AND PROS.

▸ Part 1 needs shuffle => reduce performance

▸ Modeling as a graph => easy to understand and analysis

▸ New way of fraud detection. progressing

▸ Capable of using LSH but wanted 100% accuracy

SUGGESTIONS

FUTURE

▸ Using other centrality algorithms

▸ Using algorithms like community detection instead of clustering

▸ If we injects data of patients we can do more (in a bipartite graph we can detect frauds of more popular providers).

Data & Analytics

Healthcare fraud detection