Upload
matthew-thomson
View
120
Download
1
Embed Size (px)
Citation preview
Assurance Scoring: Using Machine Learning and Analytics to Reduce Risk in the Public Sector
Matt ThomsonNatalia Angarita-Jaimes8/5/2016
2Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Outline
IntroductionTraditional Fraud DetectionAssurance ScoringMachine LearningBusiness RulesAnomaly DetectionGraph Links
3Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Who are we?
Matt Thomson Senior Data Scientist at Capgemini PhD in Astrophysics (http://arxiv.org/abs/1010.3315) Several years experience in fraud detection
Natalia Angarita-Jaimes Data Scientist at Capgemini PhD in Optical Engineering Several years experience signal and image processing.
Capgemini Big Data Analytics team 30 Data Scientists, 40 Big Data Engineers Focus on Open Source and Big Data technologies to solve client problems Sponsor the conference!
4Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Introduction to the Problem
Public sector constantly working in an environment of reduced resources
Want to provide a better service but with greater efficiency
Therefore very important that limited resources are focussed correctly
Assurance Scoring Use ML and other analytical methods to identify the least risky people or applications so
that investigators resources can be targeted on the most risky
5Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Hypothetical Example – 2016 Olympics tickets
Running the application process for selling tickets to the 2016 Olympics
Avoid selling tickets to touts/resellers Vast majority of people applying for tickets are genuine Fraud detection with big class imbalance problem (<0.1%) Avoid approach of investigating each person applying
Lets say we know from 2012 Olympics which people ended up reselling their tickets – training data
Use ML to identify the least risky 30% (say) of people wanting tickets
Investigators focus on the high risk
6Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Traditional Fraud Detection
Identify Historical
Training Data
Feature Engineering
Model Training and Evaluation
Model Execution
Feedback
7Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
Focus on low-risk
Allows resources to be better focussed
Not limited to Machine Learning
Built using Python! Pandas, Scikit-learn etc
8Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
9Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
POLE ‘Analytical’ Data Layer
Disparate data sources - Atomic Layer
Atomic data is Transformed and Loaded into POLE
POLE Layer
EventLocationObjectPerson
10Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
POLE ‘Analytical’ Data Layer
POLE contains ALL entities from the Atomic Layer, plus their inter-linkages
11Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
12Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Machine learning
Transform Selection Model
Training
Validation
Test
Feature extraction and selection Model Building
Variety of output files: logs, graphics, pickle models, etcTesting: Unit tests, monitoring tests and integration tests
Vector BuildInput Data
Manipulate, ExploreData
Framework: Structure, flexibility, consistency
13Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Machine learning : Feature Engineering
SQL, Python
Transform
Explore
Select
Ask questions, validate
Refine features
• Feature Extraction
• Data exploration
• Feature selection
Historical Data
14Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Machine Learning: Model Building
Training
Validation
Test
Split Datasets
Build Models
Hyper-parameter tuning
Selectedfeatures Models
Training results
Validation results
Testsresults
Compare Models
15Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Low risk? High risk? Depends on classifier’s threshold
• True-positives : applications the model correctly classifies as high risk
• True negatives: applications model correctly classifies as low risk
• False-positives: applications the model scores as high risk but are not
• False-negatives: applications the model scores as low risk but were in fact high risk
16Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
17Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Business Rules
Identifying Fraud often been done using deterministic rules
Look for transactions near a threshold or at the end of the day
Primarily data queries on your feature vector
Olympics example – Anyone applying for more than £10,000 tickets
18Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
19Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Anomaly Detection
Use the training data to create a baseline of applications by postcode (say)
If a particular postcode has a larger than expected number of applications then those cases pushed into high-risk bucket
20Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
21Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Graph Links - Matching
Key part of assurance scoring – bringing data together from disparate sources
Probability of Match: 80%
Attribute Data Source 1 Data Source 2
Name Matt Thomson Matthew Thosmon
Phone Number 07123 456 789 07123 456 798
Favourite Sport Football Cricket
22Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
23Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Jupyter Notebook
24Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Further Details
[email protected] / @MattGThomson
Assurance Scoring brochure: http://ow.ly/4nbEUI
Blogs:
Introduction: https://www.capgemini.com/node/1380596
Integrating multiple techniques: http://bit.ly/24BmszV
Machine Learning: http://bit.ly/1QTMGnq
More coming soon!
25Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
We’re Hiring!
Data Sciencehttps://www.uk.capgemini.com/careers/jobs/data-scientist-0
Big Data Engineerhttps://www.uk.capgemini.com/careers/jobs/big-data-engineer
Data Visualisation Analysthttps://www.uk.capgemini.com/careers/jobs/data-visualisation-analyst
The information contained in this presentation is proprietary.© 2012 Capgemini. All rights reserved.
www.capgemini.com
About CapgeminiWith more than 120,000 people in 40 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2011 global revenues of EUR 9.7 billion.Together with its clients, Capgemini creates and delivers business and technology solutions that fit their needs and drive the results they want. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business ExperienceTM, and draws on Rightshore ®, its worldwide delivery model.
Rightshore® is a trademark belonging to Capgemini