30

download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van
Page 2: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke ffirs.tex V1 - 09/01/2017 7:42am Page viii�

� �

Page 3: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke ffirs.tex V1 - 09/01/2017 7:42am Page i�

� �

Profit-DrivenBusiness Analytics

Page 4: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke ffirs.tex V1 - 09/01/2017 7:42am Page ii�

� �

Wiley & SAS BusinessSeries

The Wiley & SAS Business Series presents books that help senior-level

managers with their critical management decisions.

Titles in the Wiley & SAS Business Series include:

Analytics: The Agile Way by Phil Simon

Analytics in a Big Data World: The Essential Guide to Data Science and its

Applications by Bart Baesens

A Practical Guide to Analytics for Governments: Using Big Data for Good

by Marie Lowman

Bank Fraud: Using Technology to Combat Losses by Revathi Subrama-

nian

Big Data Analytics: Turning Big Data into Big Money by Frank Ohlhorst

Big Data, Big Innovation: Enabling Competitive Differentiation through

Business Analytics by Evan Stubbs

Business Analytics for Customer Intelligence by Gert Laursen

Business Intelligence Applied: Implementing an Effective Information and

Communications Technology Infrastructure by Michael Gendron

Business Intelligence and the Cloud: Strategic Implementation Guide by

Michael S. Gendron

Business Transformation: A Roadmap for Maximizing Organizational

Insights by Aiman Zeid

Connecting Organizational Silos: Taking Knowledge Flow Management to

the Next Level with Social Media by Frank Leistner

Data-Driven Healthcare: How Analytics and BI are Transforming the

Industry by Laura Madsen

Delivering Business Analytics: Practical Guidelines for Best Practice by

Evan Stubbs

Demand-Driven Forecasting: A Structured Approach to Forecasting, Second

Edition by Charles Chase

Demand-Driven Inventory Optimization and Replenishment: Creating a

More Efficient Supply Chain by Robert A. Davis

Page 5: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke ffirs.tex V1 - 09/01/2017 7:42am Page iii�

� �

Developing Human Capital: Using Analytics to Plan and Optimize Your

Learning and Development Investments by Gene Pease, Barbara Beres-

ford, and Lew Walker

The Executive’s Guide to Enterprise Social Media Strategy: How Social Net-

works Are Radically Transforming Your Business by David Thomas and

Mike Barlow

Economic and Business Forecasting: Analyzing and Interpreting Economet-

ric Results by John Silvia, Azhar Iqbal, Kaylyn Swankoski, Sarah

Watt, and Sam Bullard

Economic Modeling in the Post Great Recession Era: Incomplete Data,

Imperfect Markets by John Silvia, Azhar Iqbal, and Sarah Watt

House

Enhance Oil & Gas Exploration with Data-Driven Geophysical and Petro-

physical Models by Keith Holdaway and Duncan Irving

Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide

to Fundamental Concepts and Practical Applications by Robert Rowan

Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-

niques: A Guide to Data Science for Fraud Detection by Bart Baesens,

Veronique Van Vlasselaer, and Wouter Verbeke

Harness Oil and Gas Big Data with Analytics: Optimize Exploration and

Production with Data-Driven Models by Keith Holdaway

Health Analytics: Gaining the Insights to Transform Health Care by Jason

Burke

Heuristics in Analytics: A Practical Perspective of What Influences Our

Analytical World by Carlos Andre Reis Pinheiro and Fiona McNeill

Human Capital Analytics: How to Harness the Potential of Your Organi-

zation’s Greatest Asset by Gene Pease, Boyce Byerly, and Jac Fitz-enz

Implement, Improve and Expand Your Statewide Longitudinal Data Sys-

tem: Creating a Culture of Data in Education by Jamie McQuiggan and

Armistead Sapp

Intelligent Credit Scoring: Building and Implementing Better Credit Risk

Scorecards, Second Edition, by Naeem Siddiqi

Killer Analytics: Top 20 Metrics Missing from your Balance Sheet by Mark

Brown

Page 6: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke ffirs.tex V1 - 09/01/2017 7:42am Page iv�

� �

Machine Learning for Marketers: Hold the Math by Jim Sterne

On-Camera Coach: Tools and Techniques for Business Professionals in a

Video-Driven World by Karin Reed

Predictive Analytics for Human Resources by Jac Fitz-enz and John

Mattox II

Predictive Business Analytics: Forward-Looking Capabilities to Improve

Business Performance by Lawrence Maisel and Gary Cokins

Profit Driven Business Analytics: A Practitioner’s Guide to Transforming

Big Data into Added Value by Wouter Verbeke, Cristian Bravo, and

Bart Baesens

Retail Analytics: The Secret Weapon by Emmett Cox

Social Network Analysis in Telecommunications by Carlos Andre Reis

Pinheiro

Statistical Thinking: Improving Business Performance, Second Edition by

Roger W. Hoerl and Ronald D. Snee

Strategies in Biomedical Data Science: Driving Force for Innovation by Jay

Etchings

Style & Statistic: The Art of Retail Analytics by Brittany Bullard

Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data

Streams with Advanced Analytics by Bill Franks

Too Big to Ignore: The Business Case for Big Data by Phil Simon

The Analytic Hospitality Executive by Kelly A. McGuire

The Value of Business Analytics: Identifying the Path to Profitability

by Evan Stubbs

The Visual Organization: Data Visualization, Big Data, and the Quest

for Better Decisions by Phil Simon

Using Big Data Analytics: Turning Big Data into Big Money by Jared

Dean

Winwith Advanced Business Analytics: Creating Business Value from Your

Data by Jean Paul Isson and Jesse Harriott

For more information on any of the above titles, please visit

www.wiley.com.

Page 7: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke ffirs.tex V1 - 09/01/2017 7:42am Page v�

� �

Profit-DrivenBusiness AnalyticsA Practitioner’s Guide to Transforming

Big Data into Added Value

Wouter VerbekeBart Baesens

Cristián Bravo

Page 8: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke ffirs.tex V1 - 09/01/2017 7:42am Page vi�

� �

Copyright © 2018 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, ortransmitted in any form or by any means, electronic, mechanical, photocopying,recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the1976 United States Copyright Act, without either the prior written permission of thePublisher, or authorization through payment of the appropriate per-copy fee to theCopyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978)750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to thePublisher for permission should be addressed to the Permissions Department, JohnWiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201)748-6008, or online at www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have usedtheir best efforts in preparing this book, they make no representations or warrantieswith respect to the accuracy or completeness of the contents of this book andspecifically disclaim any implied warranties of merchantability or fitness for a particularpurpose. No warranty may be created or extended by sales representatives or writtensales materials. The advice and strategies contained herein may not be suitable for yoursituation. You should consult with a professional where appropriate. Neither thepublisher nor author shall be liable for any loss of profit or any other commercialdamages, including but not limited to special, incidental, consequential, or otherdamages.

For general information on our other products and services or for technical support,please contact our Customer Care Department within the United States at (800)762-2974, outside the United States at (317) 572-3993, or fax (317) 572-4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand.Some material included with standard print versions of this book may not be includedin e-books or in print-on-demand. If this book refers to media such as a CD or DVDthat is not included in the version you purchased, you may download this material athttp://booksupport.wiley.com. For more information about Wiley products, visitwww.wiley.com.

Library of Congress Cataloging-in-Publication Data is Available:

ISBN 9781119286554 (Hardcover)ISBN 9781119286998 (ePDF)ISBN 9781119286981 (ePub)

Cover Design: WileyCover Image: © Ricardo Reitmeyer/iStockphoto

Printed in the United States of America.

10 9 8 7 6 5 4 3 2 1

Page 9: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke ffirs.tex V1 - 09/01/2017 7:42am Page vii�

� �

To Luit, Titus, and Fien.

To my wonderful wife, Katrien, and kids Ann-Sophie, Victor,and Hannelore.

To my parents and parents-in-law.

To Cindy, for her unwavering support.

Page 10: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke ffirs.tex V1 - 09/01/2017 7:42am Page viii�

� �

Page 11: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke ftoc.tex V1 - 09/01/2017 7:42am Page ix�

� �

Contents

Foreword xv

Acknowledgments xvii

Chapter 1 A Value-Centric Perspective Towards Analytics 1Introduction 1Business Analytics 3

Profit-Driven Business Analytics 9Analytics Process Model 14Analytical Model Evaluation 17Analytics Team 19Profiles 19Data Scientists 20

Conclusion 23Review Questions 24Multiple Choice Questions 24Open Questions 25

References 25

Chapter 2 Analytical Techniques 28Introduction 28Data Preprocessing 29Denormalizing Data for Analysis 29Sampling 30Exploratory Analysis 31Missing Values 31Outlier Detection and Handling 32Principal Component Analysis 33

Types of Analytics 37Predictive Analytics 37Introduction 37Linear Regression 38Logistic Regression 39Decision Trees 45Neural Networks 52

ix

Page 12: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke ftoc.tex V1 - 09/01/2017 7:42am Page x�

� �

x C O N T E N T S

Ensemble Methods 56Bagging 57Boosting 57Random Forests 58Evaluating Ensemble Methods 59

Evaluating Predictive Models 59Splitting Up the Dataset 59Performance Measures for Classification Models 63Performance Measures for Regression Models 67Other Performance Measures for Predictive AnalyticalModels 68

Descriptive Analytics 69Introduction 69Association Rules 69Sequence Rules 72Clustering 74

Survival Analysis 81Introduction 81Survival Analysis Measurements 83Kaplan Meier Analysis 85Parametric Survival Analysis 87Proportional Hazards Regression 90Extensions of Survival Analysis Models 92Evaluating Survival Analysis Models 93

Social Network Analytics 93Introduction 93Social Network Definitions 94Social Network Metrics 95Social Network Learning 97Relational Neighbor Classifier 98Probabilistic Relational Neighbor Classifier 99Relational Logistic Regression 100Collective Inferencing 102

Conclusion 102Review Questions 103Multiple Choice Questions 103Open Questions 108

Notes 110References 110

Chapter 3 Business Applications 114Introduction 114Marketing Analytics 114Introduction 114

Page 13: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke ftoc.tex V1 - 09/01/2017 7:42am Page xi�

� �

C O N T E N T S xi

RFM Analysis 115Response Modeling 116Churn Prediction 118X-selling 120Customer Segmentation 121Customer Lifetime Value 123Customer Journey 129Recommender Systems 131

Fraud Analytics 134Credit Risk Analytics 139HR Analytics 141Conclusion 146Review Questions 146Multiple Choice Questions 146Open Questions 150

Note 151References 151

Chapter 4 Uplift Modeling 154Introduction 154The Case for Uplift Modeling: Response Modeling 155Effects of a Treatment 158

Experimental Design, Data Collection, and DataPreprocessing 161

Experimental Design 161Campaign Measurement of Model Effectiveness 164

Uplift Modeling Methods 170Two-Model Approach 172Regression-Based Approaches 174Tree-Based Approaches 183Ensembles 193Continuous or Ordered Outcomes 198

Evaluation of Uplift Models 199Visual Evaluation Approaches 200Performance Metrics 207

Practical Guidelines 210Two-Step Approach for Developing Uplift Models 210Implementations and Software 212

Conclusion 213Review Questions 214Multiple Choice Questions 214Open Questions 216

Note 217References 217

Page 14: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke ftoc.tex V1 - 09/01/2017 7:42am Page xii�

� �

xii C O N T E N T S

Chapter 5 Profit-Driven Analytical Techniques 220Introduction 220Profit-Driven Predictive Analytics 221The Case for Profit-Driven Predictive Analytics 221Cost Matrix 222Cost-Sensitive Decision Making with Cost-InsensitiveClassification Models 228

Cost-Sensitive Classification Framework 231Cost-Sensitive Classification 234Pre-Training Methods 235During-Training Methods 247Post-Training Methods 253Evaluation of Cost-Sensitive Classification Models 255Imbalanced Class Distribution 256Implementations 259

Cost-Sensitive Regression 259The Case for Profit-Driven Regression 259

Cost-Sensitive Learning for Regression 260During Training Methods 260Post-Training Methods 261

Profit-Driven Descriptive Analytics 267Profit-Driven Segmentation 267Profit-Driven Association Rules 280

Conclusion 283Review Questions 284Multiple Choice Questions 284Open Questions 289

Notes 290References 291

Chapter 6 Profit-Driven Model Evaluationand Implementation 296

Introduction 296Profit-Driven Evaluation of Classification Models 298Average Misclassification Cost 298Cutoff Point Tuning 303ROC Curve-Based Measures 310Profit-Driven Evaluation with Observation-DependentCosts 334

Profit-Driven Evaluation of Regression Models 338Loss Functions and Error-Based Evaluation Measures 339REC Curve and Surface 341

Conclusion 345

Page 15: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke ftoc.tex V1 - 09/01/2017 7:42am Page xiii�

� �

C O N T E N T S xiii

Review Questions 347Multiple Choice Questions 347Open Questions 350

Notes 351References 352

Chapter 7 Economic Impact 355Introduction 355Economic Value of Big Data and Analytics 355Total Cost of Ownership (TCO) 355Return on Investment (ROI) 357Profit-Driven Business Analytics 359

Key Economic Considerations 359In-Sourcing versus Outsourcing 359On Premise versus the Cloud 361Open-Source versus Commercial Software 362

Improving the ROI of Big Data and Analytics 364New Sources of Data 364Data Quality 367Management Support 369Organizational Aspects 370Cross-Fertilization 371

Conclusion 372Review Questions 373Multiple Choice Questions 373Open Questions 376

Notes 377References 377

About the Authors 378

Index 381

Page 16: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke ftoc.tex V1 - 09/01/2017 7:42am Page xiv�

� �

Page 17: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke flast.tex V1 - 09/01/2017 7:42am Page xv�

� �

ForewordSandra WilikensSecretary General, responsible for CSR and member of the ExecutiveCommittee, BNP Paribas Fortis

In today’s corporate world, strategic priorities tend to center oncustomer and shareholder value. One of the consequences is that ana-lytics often focuses too much on complex technologies and statisticsrather than long-term value creation. With their book Profit-DrivenBusiness Analytics, Verbeke, Bravo, and Baesens pertinently bringforward a much-needed shift of focus that consists of turning analyticsinto a mature, value-adding technology. It further builds on theextensive research and industry experience of the author team,making it a must-read for anyone using analytics to create valueand gain sustainable strategic leverage. This is even more true as weenter a new era of sustainable value creation in which the pursuit oflong-term value has to be driven by sustainably strong organizations.The role of corporate employers is evolving as civic involvement andsocial contribution grow to be key strategic pillars.

xv

Page 18: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke flast.tex V1 - 09/01/2017 7:42am Page xvi�

� �

Page 19: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke flast.tex V1 - 09/01/2017 7:42am Page xvii�

� �

Acknowledgments

It is a great pleasure to acknowledge the contributions and assistance ofvarious colleagues, friends, and fellow analytics lovers to the writing ofthis book. This book is the result of many years of research and teachingin business analytics. We first would like to thank our publisher, Wiley,for accepting our book proposal.

We are grateful to the active and lively business analytics commu-nity for providing various user fora, blogs, online lectures, and tutori-als, which proved very helpful.

We would also like to acknowledge the direct and indirectcontributions of the many colleagues, fellow professors, students,researchers, and friends with whom we collaborated during thepast years. Specifically, we would like to thank Floris Devriendt andGeorge Petrides for contributing to the chapters on uplift modelingand profit-driven analytical techniques.

Last but not least, we are grateful to our partners, parents, andfamilies for their love, support, and encouragement.

We have tried to make this book as complete, accurate, and enjoy-able as possible. Of course, what really matters is what you, the reader,think of it. Please let us know your views by getting in touch. Theauthors welcome all feedback and comments—so do not hesitate to letus know your thoughts!

Wouter VerbekeBart Baesens

Cristián BravoMay 2017

xvii

Page 20: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke flast.tex V1 - 09/01/2017 7:42am Page xviii�

� �

Page 21: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke c01.tex V1 - 09/01/2017 7:42am Page 1�

� �

C H A P T E R 1A Value-CentricPerspectiveTowards Analytics

INTRODUCTION

In this first chapter, we set the scene for what is ahead by broadlyintroducing profit-driven business analytics. The value-centric per-spective toward analytics proposed in this book will be positioned andcontrasted with a traditional statistical perspective. The implicationsof adopting a value-centric perspective toward the use of analytics inbusiness are significant: a mind shift is needed both from managersand data scientists in developing, implementing, and operating analyt-ical models. This, however, calls for deep insight into the underlyingprinciples of advanced analytical approaches. Providing such insight isour general objective in writing this book and, more specifically:

◾ We aim to provide the reader with a structured overview ofstate-of-the art analytics for business applications.

◾ We want to assist the reader in gaining a deeper practicalunderstanding of the inner workings and underlying principlesof these approaches from a practitioner’s perspective.

1

Page 22: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke c01.tex V1 - 09/01/2017 7:42am Page 2�

� �

2 P R O F I T - D R I V E N B U S I N E S S A N A L Y T I C S

◾ We wish to advance managerial thinking on the use of advancedanalytics by offering insight into how these approaches may eithergenerate significant added value or lower operational costs byincreasing the efficiency of business processes.

◾ We seek to prosper and facilitate the use of analytical approaches thatare customized to needs and requirements in a business context.

As such, we envision that our book will facilitate organizationsstepping up to a next level in the adoption of analytics for decisionmaking by embracing the advanced methods introduced in thesubsequent chapters of this book. Doing so requires an investmentin terms of acquiring and developing knowledge and skills but, as isdemonstrated throughout the book, also generates increased profits.An interesting feature of the approaches discussed in this book isthat they have often been developed at the intersection of academiaand business, by academics and practitioners joining forces for tun-ing a multitude of approaches to the particular needs and problemcharacteristics encountered and shared across diverse business settings.

Most of these approaches emerged only after the millennium,which should not be surprising. Since the millennium, we have wit-nessed a continuous and pace-gaining development and an expandingadoption of information, network, and database technologies. Keytechnological evolutions include the massive growth and successof the World Wide Web and Internet services, the introduction ofsmart phones, the standardization of enterprise resource planningsystems, and many other applications of information technology. Thisdramatic change of scene has prospered the development of analyticsfor business applications as a rapidly growing and thriving branch ofscience and industry.

To achieve the stated objectives, we have chosen to adopt apragmatic approach in explaining techniques and concepts. We donot focus on providing extensive mathematical proof or detailedalgorithms. Instead, we pinpoint the crucial insights and underlyingreasoning, as well as the advantages and disadvantages, related to thepractical use of the discussed approaches in a business setting. Forthis, we ground our discourse on solid academic research expertise aswell as on many years of practical experience in elaborating industrialanalytics projects in close collaboration with data science professionals.Throughout the book, a plethora of illustrative examples and casestudies are discussed. Example datasets, code, and implementations

Page 23: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke c01.tex V1 - 09/01/2017 7:42am Page 3�

� �

A V A L U E- C E N T R I C P E R S P E C T I V E T O W A R D S A N A L Y T I C S 3

are provided on the book’s companion website, www.profit-analytics.com, to further support the adoption of the discussed approaches.

In this chapter, we first introduce business analytics. Next, theprofit-driven perspective toward business analytics that will be elab-orated in this book is presented. We then introduce the subsequentchapters of this book and how the approaches introduced in thesechapters allow us to adopt a value-centric approach for maximizingprofitability and, as such, to increase the return on investment ofbig data and analytics. Next, the analytics process model is dis-cussed, detailing the subsequent steps in elaborating an analyticsproject within an organization. Finally, the chapter concludes bycharacterizing the ideal profile of a business data scientist.

Business Analytics

Data is the new oil is a popular quote pinpointing the increasing value ofdata and—to our liking—accurately characterizes data as rawmaterial.Data are to be seen as an input or basic resource needing furtherprocessing before actually being of use. In a subsequent section inthis chapter, we introduce the analytics process model that describesthe iterative chain of processing steps involved in turning data intoinformation or decisions, which is quite similar actually to an oil refineryprocess. Note the subtle but significant difference between the wordsdata and information in the sentence above. Whereas data fundamen-tally can be defined to be a sequence of zeroes and ones, informationessentially is the same but implies in addition a certain utility or valueto the end user or recipient. So, whether data are information dependson whether the data have utility to the recipient. Typically, for rawdata to be information, the data first need to be processed, aggregated,summarized, and compared. In summary, data typically need to beanalyzed, and insight, understanding, or knowledge should be addedfor data to become useful.

Applying basic operations on a dataset may already provide usefulinsight and support the end user or recipient in decision making. Thesebasic operations mainly involve selection and aggregation. Both selec-tion and aggregation may be performed in many ways, leading to aplentitude of indicators or statistics that can be distilled from raw data.The following illustration elaborates a number of sales indicators in aretail setting.

Providing insight by customized reporting is exactly what the fieldof business intelligence (BI) is about. Typically, visualizations arealso adopted to represent indicators and their evolution in time, ineasy-to-interpret ways. Visualizations provide support by facilitating

Page 24: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke c01.tex V1 - 09/01/2017 7:42am Page 4�

� �

4 P R O F I T - D R I V E N B U S I N E S S A N A L Y T I C S

EXAMPL EFor managerial purposes, a retailer requires the development of real-time salesreports. Such a report may include a wide variety of indicators that summarizeraw sales data. Raw sales data, in fact, concern transactional data that can beextracted from the online transaction processing (OLTP) system that isoperated by the retailer. Some example indicators and the required selectionand aggregation operations for calculating these statistics are:

◾ Total amount of revenues generated over the last 24 hours: Select alltransactions over the last 24 hours and sum the paid amounts, withpaid meaning the price net of promotional offers.

◾ Average paid amount in online store over the last seven days: Selectall online transactions over the last seven days and calculate theaverage paid amount;

◾ Fraction of returning customers within one month: Select alltransactions over the last month and select customer IDs thatappear more than once; count the number of IDs.

Remark that calculating these indicators involves basic selectionoperations on characteristics or dimensions of transactions stored in thedatabase, as well as basic aggregation operations such as sum, count, andaverage, among others.

the user’s ability to acquire understanding and insight in the blink ofan eye. Personalized dashboards, for instance, are widely adopted inthe industry and are very popular with managers to monitor and keeptrack of business performance. A formal definition of business intelli-gence is provided by Gartner (http://www.gartner.com/it-glossary):

Business intelligence is an umbrella term that includes theapplications, infrastructure and tools, and best practicesthat enable access to and analysis of information toimprove and optimize decisions and performance.

Note that this definition explicitly mentions the required infras-tructure and best practices as an essential component of BI, whichis typically also provided as part of the package or solution offeredby BI vendors and consultants. More advanced analysis of data mayfurther support users and optimize decision making. This is exactlywhere analytics comes into play. Analytics is a catch-all term coveringa wide variety of what are essentially data-processing techniques.

Page 25: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke c01.tex V1 - 09/01/2017 7:42am Page 5�

� �

A V A L U E- C E N T R I C P E R S P E C T I V E T O W A R D S A N A L Y T I C S 5

In its broadest sense, analytics strongly overlaps with data science,statistics, and related fields such as artificial intelligence (AI) andmachine learning. Analytics, to us, is a toolbox containing a varietyof instruments and methodologies allowing users to analyze data for adiverse range of well-specified purposes. Table 1.1 identifies a numberof categories of analytical tools that cover diverse intended uses or, inother words, allow users to complete a diverse range of tasks.

A first main group of tasks identified in Table 1.1 concerns pre-diction. Based on observed variables, the aim is to accurately estimateor predict an unobserved value. The applicable subtype of predictiveanalytics depends on the type of target variable, which we intend tomodel as a function of a set of predictor variables. When the targetvariable is categorical in nature, meaning the variable can only takea limited number of possible values (e.g., churner or not, fraudster ornot, defaulter or not), then we have a classification problem. Whenthe task concerns the estimation of a continuous target variable (e.g.,sales amount, customer lifetime value, credit loss), which can takeany value over a certain range of possible values, we are dealingwith regression. Survival analysis and forecasting explicitly accountfor the time dimension by either predicting the timing of events(e.g., churn, fraud, default) or the evolution of a target variable intime (e.g., churn rates, fraud rates, default rates). Table 1.2 providessimplified example datasets and analytical models for each type ofpredictive analytics for illustrative purposes.

The second main group of analytics comprises descriptive analyticsthat, rather than predicting a target variable, aim at identifyingspecific types of patterns. Clustering or segmentation aims at groupingentities (e.g., customers, transactions, employees, etc.) that aresimilar in nature. The objective of association analysis is to findgroups of events that frequently co-occur and therefore appear tobe associated. The basic observations that are being analyzed inthis problem setting consist of variable groups of events; for instance,transactions involving various products that are being bought by acustomer at a certain moment in time. The aim of sequence analysis

Table 1.1 Categories of Analytics from a Task-OrientedPerspective

Predictive Analytics Descriptive Analytics

ClassificationRegressionSurvival analysisForecasting

ClusteringAssociation analysisSequence analysis

Page 26: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke c01.tex V1 - 09/01/2017 7:42am Page 6�

� �

6 P R O F I T - D R I V E N B U S I N E S S A N A L Y T I C S

Table 1.2 Example Datasets and Predictive Analytical Models

Example dataset Predictive analytical model

Classification

ID Recency Frequency Monetary Churn

C1 26 4.2 126 Yes

C2 37 2.1 59 No

C3 2 8.5 256 No

C4 18 6.2 89 No

C5 46 1.1 37 Yes

. . . . . . . . . . . . . . .

Decision tree classificationmodel:

Frequency < 5

Recency > 25

Churn = Yes

Churn = No

Churn = No

Yes

Yes

No

No

Regression

ID Recency Frequency Monetary CLV

C1 26 4.2 126 3,817

C2 37 2.1 59 4,31

C3 2 8.5 256 2,187

C4 18 6.2 89 543

C5 46 1.1 37 1,548

. . . . . . . . . . . . . . .

Linear regression model:

CLV = 260 + 11 ⋅ Recency

+ 6.1 ⋅ Frequency

+ 3.4 ⋅ Monetary

Survival analysis

ID RecencyChurnor Censored

Time of churnor Censoring

C1 26 Churn 181

C2 37 Censored 253

C3 2 Censored 37

C4 18 Censored 172

C5 46 Churn 98

. . . . . . . . . . . .

General parametric survivalanalysis model:

log(T) = 13 + 5.3 ⋅ Recency

Forecasting

Timestamp Demand

January 513

February 652

March 435

April 578

May 601

. . . . . .

Weighted moving averageforecasting model:

Demandt = 0.4 ⋅ Demandt−1

+ 0.3 ⋅ Demandt−2

+ 0.2 ⋅ Demandt−3

+ 0.1 ⋅ Demandt−4

Page 27: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke c01.tex V1 - 09/01/2017 7:42am Page 7�

� �

A V A L U E- C E N T R I C P E R S P E C T I V E T O W A R D S A N A L Y T I C S 7

Table 1.3 Example Datasets and Descriptive Analytical Models

Data Descriptive analytical model

Clustering

ID Recency Frequency

C1 26 4.2

C2 37 2.1

C3 2 8.5

C4 18 6.2

C5 46 1.1

. . . . . . . . .

K-means clustering with K = 3:

0

10

Fre

quen

cy

9876543210

10 20 30Recency

40 50 60

Association analysis

ID Items

T1 beer, pizza, diapers, baby food

T2 coke, beer, diapers

T3 crisps, diapers, baby food

T4 chocolates, diapers, pizza, apples

T5 tomatoes, water, oranges, beer

. . . . . .

Association rules:

If baby food And diapers Then beerIf coke And pizza Then crisps

. . .

Sequence analysis

ID Sequential items

C1 <{3},{9}>

C2 <{1 2},{3},{4 6 7}>

C3 <{3 5 7}>

C4 <{3},{4 7},{9}>

C5 <{9}>

. . . . . .

Sequence rules:

Item 3

Item 3

Item 9

Item 4 & 7 Item 31

. . .

is similar to association analysis but concerns the detection of eventsthat frequently occur sequentially, rather than simultaneously as inassociation analysis. As such, sequence analysis explicitly accounts forthe time dimension. Table 1.3 provides simplified examples of datasetsand analytical models for each type of descriptive analytics.

Page 28: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke c01.tex V1 - 09/01/2017 7:42am Page 8�

� �

8 P R O F I T - D R I V E N B U S I N E S S A N A L Y T I C S

Note that Tables 1.1 through 1.3 identify and illustrate categoriesof approaches that are able to complete a specific task from a technicalrather than an applied perspective. These different types of analyticscan be applied in quite diverse business and nonbusiness settings andconsequently lead to many specialized applications. For instance, pre-dictive analytics and,more specifically, classification techniquesmay beapplied for detecting fraudulent credit-card transactions, for predictingcustomer churn, for assessing loan applications, and so forth. From anapplication perspective, this leads to various groups of analytics suchas, respectively, fraud analytics, customer or marketing analytics, andcredit risk analytics. A wide range of business applications of analyt-ics across industries and business departments is discussed in detailin Chapter 3.

With respect to Table 1.1, it needs to be noted that these differ-ent types of analytics apply to structured data. An example of astructured dataset is shown in Table 1.4. The rows in such a datasetare typically called observations, instances, records, or lines, andrepresent or collect information on basic entities such as customers,transactions, accounts, or citizens. The columns are typically referredto as (explanatory or predictor) variables, characteristics, attributes,predictors, inputs, dimensions, effects, or features. The columnscontain information on a particular entity as represented by a rowin the table. In Table 1.4, the second column represents the age of acustomer, the third column the postal code, and so on. In this book weconsistently use the terms observation and variable (and sometimesmore specifically, explanatory, predictor, or target variable).

Because of the structure that is present in the dataset in Table 1.4and the well-defined meaning of rows and columns, it is much easierto analyze such a structured dataset compared to analyzing unstruc-tured data such as text, video, or networks, to name a few. Specializedtechniques exist that facilitate analysis of unstructured data—forinstance, text analytics with applications such as sentiment analysis,video analytics that can be applied for face recognition and incidentdetection, and network analytics with applications such as community

Table 1.4 Structured Dataset

Customer Age Income Gender Duration Churn

John 30 1,800 Male 620 Yes

Sarah 25 1,400 Female 12 No

Sophie 52 2,600 Female 830 No

David 42 2,200 Male 90 Yes

Page 29: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke c01.tex V1 - 09/01/2017 7:42am Page 9�

� �

A V A L U E- C E N T R I C P E R S P E C T I V E T O W A R D S A N A L Y T I C S 9

mining and relational learning (see Chapter 2). Given the roughestimate that over 90% of all data are unstructured, clearly there is alarge potential for these types of analytics to be applied in business.

However, due to the inherent complexity of analyzing unstruc-tured data, as well as because of the often-significant developmentcosts that only appear to pay off in settings where adopting thesetechniques significantly adds to the easier-to-apply structuredanalytics, currently we see relatively few applications in businessbeing developed and implemented. In this book, we therefore focuson analytics for analyzing structured data, and more specifically thesubset listed in Table 1.1. For unstructured analytics, one may referto the specialized literature (Elder IV and Thomas 2012; Chakraborty,Murali, and Satish 2013; Coussement 2014; Verbeke, Martens andBaesens 2014; Baesens, Van Vlasselaer, and Verbeke 2015).

PROFIT-DRIVEN BUSINESS ANALYTICS

The premise of this book is that analytics is to be adopted in busi-ness for better decision making—“better” meaning optimal in termsof maximizing the net profits, returns, payoff, or value resultingfrom the decisions that are made based on insights obtained fromdata by applying analytics. The incurred returns may stem from again in efficiency, lower costs or losses, and additional sales, amongothers. The decision level at which analytics is typically adopted is theoperational level, where many customized decisions are to be madethat are similar and granular in nature. High-level, ad hoc decisionmaking at strategic and tactical levels in organizations also may benefitfrom analytics, but expectedly to a much lesser extent.

The decisions involved in developing a business strategy are highlycomplex in nature and do not match the elementary tasks enlisted inTable 1.1. A higher-level AI would be required for such purpose, whichis not yet at our disposal. At the operational level, however, there aremany simple decisions to be made, which exactly match with the taskslisted in Table 1.1. This is not surprising, since these approaches haveoften been developed with a specific application in mind. In Table 1.5,we provide a selection of example applications, most of which will beelaborated on in detail in Chapter 3.

Analytics facilitates optimization of the fine granular decision-making activities listed in Table 1.5, leading to lower costs or losses andhigher revenues and profits. The level of optimization depends on theaccuracy and validity of the predictions, estimates, or patterns derivedfrom the data. Additionally, as we stress in this book, the quality

Page 30: download.e-bookshelf.de · Fraud Analytics Using Descriptive, Predictive, and Social Network Tech-niques: A Guide to Data Science for Fraud Detection by Bart Baesens, Veronique Van

Trim Size: 6in x 9in Verbeke c01.tex V1 - 09/01/2017 7:42am Page 10�

� �

10 P R O F I T - D R I V E N B U S I N E S S A N A L Y T I C S

Table 1.5 Examples of Business Decisions Matching Analytics

Decision Making with Predictive Analytics

Classification Credit officers have to screen loan applications and decide on whetherto accept or reject an application based on the involved risk. Based onhistorical data on the performance of past loan applications, aclassification model may learn to distinguish good from bad loanapplications using a number of well-chosen characteristics of theapplication as well as of the applicant. Analytics and, more specifically,classification techniques allow us to optimize the loan-granting processby more accurately assessing risk and reducing bad loan losses (VanGestel and Baesens 2009; Verbraken et al. 2014). Similar applicationsof decision making based on classification techniques, which arediscussed in more detail in Chapter 3 of this book, include customerchurn prediction, response modeling, and fraud detection.

Regression Regression models allow us to estimate a continuous target value andin practice are being adopted, for instance, to estimate customerlifetime value. Having an indication on the future worth in terms ofrevenues or profits a customer will generate is important to allowcustomization of marketing efforts, for pricing, etc. As is discussed indetail in Chapter 3, analyzing historical customer data allowsestimating the future net value of current customers using aregression model.Similar applications involve loss given default modeling as is discussedin Chapter 3, as well as the estimation of software development costs(Dejaeger et al. 2012).

Survival analysis Survival analysis is being adopted in predictive maintenanceapplications for estimating when a machine component will fail. Suchknowledge allows us to optimize decisions related to machinemaintenance—for instance, to optimally plan when to replace a vitalcomponent. This decision requires striking a balance between the costof machine failure during operations and the cost of the component,which is preferred to be operated as long as possible before replacing it(Widodo and Yang 2011).Alternative business applications of survival analysis involve theprediction of time to churn and time to default where, compared toclassification, the focus is on predicting when the event will occurrather than whether the event will occur.

Forecasting A typical application of forecasting involves demand forecasting, whichallows us to optimize production planning and supply chainmanagement decisions. For instance, a power supplier needs to be ableto balance electricity production and demand by the consumers andfor this purpose adopts forecasting or time-series modeling techniques.These approaches allow an accurate prediction of the short-termevolution of demand based on historical demand patterns (Hyndmanet al. 2008).