65
Linköpings universitet SE– Linköping + , www.liu.se Linköping University | Department of Computer and Information Science Master’s thesis, 30 ECTS | Statistics and Machine Learning 2020 | LIU-IDA/STAT-A--20/003--SE Omnichannel path to purchase Viability of Bayesian Network as Market Attribution Models Anubhav Dikshit Supervisor : Hao Chi Kiang Examiner : Anders Nordgaard External supervisor : Viking Fristrom and Anna Lundmark

Anubhav Dikshit1436074/FULLTEXT01.pdfLinköpings universitet SE–581 83 Linköping +46 13 28 10 00 , Linköping University | Department of Computer and Information Science Master’s

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Linköpings universitetSE–581 83 Linköping+46 13 28 10 00 , www.liu.se

    Linköping University | Department of Computer and Information ScienceMaster’s thesis, 30 ECTS | Statistics and Machine Learning

    2020 | LIU-IDA/STAT-A--20/003--SE

    Omnichannel path to purchaseViability of Bayesian Network as Market Attribution Models

    Anubhav Dikshit

    Supervisor : Hao Chi KiangExaminer : Anders Nordgaard

    External supervisor : Viking Fristrom and Anna Lundmark

    http://www.liu.se

  • Upphovsrätt

    Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annananvändning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligheten finns lösningar av teknisk och administrativ art.Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning somgod sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentetändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.För ytterligare information om Linköping University Electronic Press se förlagets hemsidahttp://www.ep.liu.se/.

    Copyright

    The publishers will keep this document online on the Internet - or its possible replacement - for aperiod of 25 years starting from the date of publication barring exceptional circumstances.The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercialresearch and educational purpose. Subsequent transfers of copyright cannot revoke this permission.All other uses of the document are conditional upon the consent of the copyright owner. The publisherhas taken technical and administrative measures to assure authenticity, security and accessibility.According to intellectual property law the author has the right to bementionedwhen his/her workis accessed as described above and to be protected against infringement.For additional information about the Linköping University Electronic Press and its proceduresfor publication and for assurance of document integrity, please refer to its www home page:http://www.ep.liu.se/.

    © Anubhav Dikshit

    http://www.ep.liu.se/http://www.ep.liu.se/

  • Abstract

    Market attribution is the problem of interpreting the influence of advertisements onthe user’s decision process. Market attribution is a hard problem, and it happens to be asignificant reason for Google’s revenue. There are broadly two types of attribution models- data-driven and heuristics. This thesis focuses on the data driven attribution modeland explores the viability of using Bayesian Networks as market attribution models andbenchmarks the performance against a logistic regression. The data used in this thesiswas prepossessed using undersampling technique. Furthermore, multiple techniques andalgorithms to learn and train Bayesian Networks are explored and evaluated.

    For the given dataset, it was found that Bayesian Network can be used for market at-tribution modeling and that its performance is better than the baseline logistic model.

    Keywords: Market Attribution Model, Bayesian Network, Logistic Regression.

  • Acknowledgments

    I would like to thank everyone at my university for making my time during my mastersa pleasant experience. There will always be a little bit of the LiU in me wherever I go. Iwould further thank Fanny, Anna and Viking at Nepa for providing me the opportunity toexperience Nepa and meet so many wonderful people.

    I would like to thank my supervisor Hao Chi for all the feedback and motivation thathe provided. Our conversation has always been profound and thought-provoking. I wouldalso like to thank Yusur Almutair, my opponent who has been instrumental to ensure thatthis thesis conforms to the academic standards.

    Finally, I would like to thank my family for all the support and encouragement through-out the master’s program, especially my wife, Chetana Deshpande.

    iv

  • Contents

    Abstract iii

    Acknowledgments iv

    Contents v

    List of Figures vii

    List of Tables viii

    1 Introduction 11.1 Nepa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 Data 32.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Data summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.3.1 Imbalanced data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3.2 Estimation of optimal undersampling ratio . . . . . . . . . . . . . . . . . 52.3.3 Distribution of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3 Theoretical Background 93.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Learning a Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    3.2.1 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.2 Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.3 Bayesian Network Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.1 Constraint-based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.2 Score-based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3.3 Hybrid algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    3.4 Conditional independence tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.5 Inference on Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.5.1 Exact Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.5.2 Approximate Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    3.6 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.6.1 Log-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.6.2 The Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    3.7 Metrics for Evaluation of Model Performance . . . . . . . . . . . . . . . . . . . . 193.7.1 Bayesian Information Criterion (BIC) . . . . . . . . . . . . . . . . . . . . 193.7.2 F1 score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.7.3 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    v

  • 3.7.4 Positive Class Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.7.5 Balanced Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    3.8 Techniques for Measuring Association . . . . . . . . . . . . . . . . . . . . . . . . 203.8.1 Chi-Square Test (χ2 test) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.8.2 Brute Force Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.8.3 Multiple Correspondence Analysis . . . . . . . . . . . . . . . . . . . . . . 21

    4 Methods 244.1 Baseline Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    4.1.1 Optimal Decision Threshold of Logistic Regression . . . . . . . . . . . . 244.2 Bayesian Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    4.2.1 White-list Creation Using Hypothesis Testing . . . . . . . . . . . . . . . 264.2.2 White-list Creation Using Multiple Correspondence Analysis . . . . . . 264.2.3 White-list Creation Using Grid Search . . . . . . . . . . . . . . . . . . . . 284.2.4 Black-list Creation Using Domain Knowledge . . . . . . . . . . . . . . . 294.2.5 Optimal Restart Using Cross-Validation . . . . . . . . . . . . . . . . . . . 30

    4.3 Bayesian Network Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . 304.4 Attribution Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    5 Results 315.1 Baseline Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2 Bayesian Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.3 Distribution of Attribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.4 Comparison of Attribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    6 Discussion 366.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    6.2.1 Undersampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.2.2 Bayesian Network Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 386.2.3 Attribution Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    6.3 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    7 Conclusion 40

    Bibliography 41

    A Appendix 45

    vi

  • List of Figures

    2.1 Visual depiction of random undersampling, adapted from [undersampling_Example_source] 52.2 Plots of performance metrics vs. Undersampling ratio, optimal undersampling

    ratio 1.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Plot to check overfitting by comparing train and valid data-set balanced accuracy . 7

    3.1 DAG mapping probability, source from [Graduate_Teaching] . . . . . . . . . . . . 103.2 DAG and CPDAG, adapted from [Graduate_Teaching] . . . . . . . . . . . . . . . . 103.3 MCA notation, source[MCA_slide] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4 Point cloud of categories/columns, source [MCA_slide] . . . . . . . . . . . . . . . 22

    4.1 Plot showing the model metrics vs. the decision boundary cutoff . . . . . . . . . . 254.2 Testing for over-fitting at different decision boundary cutoff . . . . . . . . . . . . . 254.3 Plot showing the Variance Explained vs. Eigenvectors . . . . . . . . . . . . . . . . . 274.4 Plot showing the contribution of variables towards Eigenvector 1 . . . . . . . . . . 274.5 Plot showing the contribution of variables towards Eigenvector 2 . . . . . . . . . . 284.6 BIC vs. Iteration using Grid Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.7 Balanced Accuracy vs. Iteration using Grid Search . . . . . . . . . . . . . . . . . . . 294.8 Balanced Accuracy vs. Random Restarts . . . . . . . . . . . . . . . . . . . . . . . . . 30

    5.1 Plot of Bayesian Network model with the highest accuracy, build using Hill climb-ing algorithm with white-list using Chi-square test . . . . . . . . . . . . . . . . . . . 32

    5.2 Distribution of attribution values for variables from Bayesian Network Model . . . 335.3 Distribution of attribution values for variables from Logistic Model . . . . . . . . . 34

    A.1 Plot of Model using MMHC algorithm with white-list using using Chi-square test 47A.2 Plot of Model using Tabu Search Algorithm with white-list using Chi-square test . 48A.3 Plot of Model using RSMAX2 algorithm with white-list using Chi-square test . . . 49A.4 Plot of Model using RSMAX2 algorithm with white-list using Grid Search . . . . . 50A.5 Plot of Model using Hill climbing algorithm with white-list using Grid search . . . 51A.6 Plot of Model using MMHC algorithm with white-list using using Grid Search . . 52A.7 Plot of Model using Tabu Search Algorithm with white-list using Grid Search . . . 53A.8 Plot of Model using RSMAX2 algorithm with white-list using MCA . . . . . . . . . 54A.9 Plot of Model using MMHC algorithm with white-list using using MCA . . . . . . 55A.10 Plot of Model using Hill climbing algorithm with white-list using MCA . . . . . . 56A.11 Plot of Model using Tabu Search Algorithm with white-list using MCA . . . . . . . 57

    vii

  • List of Tables

    2.1 Sample data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Variable description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Proportion of conv==0 by column value . . . . . . . . . . . . . . . . . . . . . . . . . 8

    3.1 Confusion matrix for a binary class problem . . . . . . . . . . . . . . . . . . . . . . 20

    4.1 Hypothesis Table with P-values, tested at 95% significance . . . . . . . . . . . . . . 26

    5.1 Performance Metrics for Logistic Model . . . . . . . . . . . . . . . . . . . . . . . . . 315.2 Performance of Models using different algorithms and white-list techniques . . . . 315.3 Mean attribution values of variable for Bayesian Network Model and Logistic Model 35

    A.1 Logistic Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45A.2 95% Confidence interval of attribution values . . . . . . . . . . . . . . . . . . . . . . 46

    viii

  • 1 Introduction

    This section provides a brief introduction to the concept of ’Market Attribution,’ ’Path toPurchase,’ as well as the company and the purpose of the thesis project undertaken.

    1.1 Nepa

    This thesis was done in collaboration with Nepa AB. Nepa is a Global Consumer Sciencemarketing company based out of Stockholm, Sweden, with offices in Helsinki, Oslo, Copen-hagen, London, and Mumbai.

    1.2 Background

    Shao and Li [1] define ’Market Attribution’ as the problem of interpreting the influence ofadvertisements on user’s decision process. The goal of attribution modeling is to pin-pointthe credit assignment of each positive user (customer who made a purchase or clicked at anad) to one or more advertising touchpoints.

    In 1898, St.Elmo developed a theoretical model to help understand customer journey/pur-chase funnel. This model is called the ’AIDA’ model [2]. The stages proposed by this modelare:

    1. Awareness – the customer is aware of the existence of a product or service.

    2. Interest – actively expressing an interest in a product group.

    3. Desire – aspiring to a particular brand or product.

    4. Action – taking the next step towards purchasing the chosen product.

    Although the ’AIDA’ model has its flaws, such as assuming a linear step-by-step pro-cess behind the purchase decision, as mentioned in [3], the ’AIDA’ model becomes thebasis for more recent approaches. Today a customer can interact with multiple touchpointsthroughout the purchase journey. And many of these customers research on one device whilepurchasing through other devices [4]. Thus a unified approach is needed to establish the

    1

  • 1.3. Objective

    customer purchase journey while navigating through multiple touchpoints. Path to purchaseanalysis refers to the analysis of the sequence of channels (touchpoints) that customers wereexposed throughout the ’purchase funnel.’

    Previous methods for ’Market Attribution’ include heuristic approaches such as ’Last ClickRule’ or ’First Click Rule,’ which assigns the 100% credit for the first interaction/touchpoint(eg: first advertisement source) or last interaction (eg: last advertisement source). Thesemethods are highly flawed models, as pointed out by [5], since they ignore the whole journeyof the customer/transaction. Thus, data driven techniques were deployed to perform marketattribution. These models use statistical and machine learning techniques.

    Shao And Li were the first pioneers of attribution using data-driven methods [1]. Theychose ’Logistic Regression model with bagging’ to reduce estimation variability and a proba-bilistic model to compute the probability of conversion by each channel/touchpoints. ’GameTheory’ based models were explored in [6]. This paper viewed the influence of variablesfrom a causal framework, and its attribution formula has been adopted in this current thesis.Models with a ’carryover’ effect and ’spillover’ effects were explored in [7], where eachchannel/touchpoints were not viewed linear but had some effect from other channels/-touchpoints influencing the customer’s decision. ’Mutually Exciting Point Process’ modelswere explored in [8]; where the effects of a channel/touchpoints were modeled as randomeffects with some interaction between the past events, thus accounting for a time effect.

    The use of ’Hidden Markov Models (HMM)’ was explored in [9], where along with HMM,the concept of conversion funnel/purchase was used. Econometric based models were ex-plored by in [10], where attribution was based on the return on investment (ROI) calculationsperformed using time series analysis. ’Directed Markov’ graph models for attribution wereexplored in [11]. Under these models, the present action was assumed to depend on the lastk actions.

    Apart from the literature review papers, the master thesis by Neville titled "Channel attribu-tion modelling using clickstream data from an online store" [12] has been a reference for thecurrent thesis. Neville compared the ’Last Click Rule’ with a logistic regression-based marketattribution model, the confidence interval for attribution was computed using bootstrapping.

    For this thesis, the choice of modeling technique is primarily driven by personal interestand has been narrowed to ’Bayesian Networks.’ Another reason that serves as a validationof our choice of model is that Bayesian Networks broadly satisfy the three properties(’Datadriven’, ’Fairness’, ’Interpretability’) proposed by [6]. The paper "A Bayesian Network Modelof the Consumer Complaint Process" [13] assured of the viability of ’Bayesian Network’ inan industry setting.

    1.3 Objective

    This thesis attempts to answer the following research questions:

    1. Can Bayesian Networks be a good fit for market attribution modeling?

    2. Can Bayesian Networks outperform the baseline model in terms of accuracy, consider-ing that they emphasize ’Interpretability’?

    2

  • 2 Data

    2.1 Overview

    Twice a week, about 2,000 people (paid by Nepa) participate in an online survey. This surveyis conducted multiple times (6-12 times). The reason for multiple surveys is because theaverage respondent answers one survey per week. The respondents are selected after afirst screening process to screen folks that are interested in the product the survey of whichis about to be conducted. This survey consists of questions about the different sources ofadvertisement (channels/touchpoints). Whether or not the respondent ended up purchasingthe product, the primary aim of this survey is to capture the whole purchase journey orpath-to-purchase.

    Because data capture is via a survey, there is an inherent lack of ’ground truth,’ i.e., there isno guarantee that a customer has made a purchase of the product and remembers it correctlyand vice versa. However, it is of a general belief that admittance of something is far moretrustworthy than the absence of it. Thus, from the data, understanding the interactionsbetween sources of advertisements and the psych of respondents is of greater significancethan predicting of a purchase.

    Each respondent was given a unique ’respondentID,’ and the touchpoint mentioned bythem was captured in the form of a binary flag. Every respondent ends their journey eitherwith a purchase or non-purchase. If a respondent makes a purchase, then it is flagged as asuccessful conversion and reflected as ’1’ in the column ’conv.’

    The sample data with touchpoints for each respondent is shown in table 2.1.

    2.2 Data summary

    There are 26 variables in the dataset used in the project; however, only 24 of them were usedsince ’respondentid’ is just an identifier and ’conv’ is flag indicating conversion. Table 2.2provides a brief description of the variables.

    3

  • 2.3. Data Pre-Processing

    Table 2.1: Sample data

    Respondentid Touchpoints

    conv ads_on_tv brand_website social_media instore_research

    0948279158856 0 1 0 1 00948279393368 1 1 0 0 10948279446624 1 0 1 1 0

    Table 2.2: Variable description

    Variable Description

    respondentid Unique ID for respondentconv Did the respondent made a purchase?ads_on_radio_streaming Did the respondent hear an advertisement on the radio?ads_on_tv Did the respondent view an advertisement on the TV?banner_ads_online_not_social Did the respondent view a banner advertisement on a website?brand_website Did the respondent visit the website of the brand conducting the survey?friend_family_recommendation Was the respondent recommended the product from their family or friends?i_saw_an_offer_promotion Did the respondent encounter a promotional offer?i_saw_something_new Did the respondent witness a new product?instore_research Did the respondent inquire about a product while at the store (talking to staff)?magazine_or_newspaper_ads Did the respondent see an advertisement on the magazine or newspaper?online_retailer_research Did the respondent research about a product by visiting an online retailer?online_retailer_visit Did the respondent visit any online retailer sites (eg: Amazon)?online_video_ad Did the respondent see a video advertisement while surfing?outdoor_ads Did the respondent view a banner advertisement outside somewhere?previous_shopping_list Was the product part of the shopping list?promo_coupon_leaflet_from_retailer Did the respondent encounter a promotional offer from a retailer?promo_coupon_leaflet_not_from_retailer Did the respondent encounter a promotional offer?recipe_site Did the respondent visit a recipe site?researched_on_search_engine Did the respondent researched about a product using search engine?saw_a_product_display Did the respondent see a demo or active display of the product?saw_a_sign_poster Did the respondent witness a product advertisement on a poster?search_engine_ads Did the respondent encounter an advertisement while using a search engine?

    social_media Did the respondent follow a brand on social media?Did the respondent hear about a product through social media?there_was_a_seasonal_event_or_occasional Was there a seasonal event or occasion for the product?male Was the respondent male?

    The dataset used in the project has 9,489 rows, with 2,088 unique respondents. The ratioof conversion to non-conversion is about 1:5.

    2.3 Data Pre-Processing

    Having seen the data and its features, it is time to focus on data pre-processing, where differ-ent techniques that were used on the dataset are discussed. The problem of imbalanced dataand various methods to tackle this problem are discussed in the subsequent chapters.

    2.3.1 Imbalanced data

    Imbalanced data is defined as "data with an unequal number of examples in each of itsclasses" in [14]. The data provided from Nepa has one conversion (conv == 1) for everyfive non-conversions (conv == 0). Thus, it is clearly imbalanced data. This imbalanced/rar-ity of conversion is expected, given that public consumption of resources cannot increasewith an increase in marketing expenditure. Theoretically speaking, an imbalanced datasetis not a problem, however, as mentioned in [15], "The key point is that relative rarity/classimbalance is a problem only because learning algorithms cannot efficiently handle such data".

    A dataset having a few instances of one class often leads to a learning algorithm unableto generalize the behavior of the minority class, leading to the algorithm performing poorly

    4

  • 2.3. Data Pre-Processing

    in terms of predictive accuracy [16]. When the data is unbalanced, standard machine learningalgorithms that maximize overall accuracy tend to classify all observations as majority classinstances [17].

    When a logistic model and Bayesian network was trained on the given dataset, both thesemodels were almost naive models (classifying everything as majority class). Thus, confirmingthe problem of imbalanced data on the model performance. Some of the techniques to tacklethe problems arising from imbalanced data are: undersampling technique, oversamplingtechnique, boosting technique and bagging technique.

    Figure 2.1: Visual depiction of random undersampling, adapted from [18]

    Undersampling is defined in [17] as downsizing the majority class by removing observa-tions at random until the dataset is balanced. Oversampling technique involves sampling thepositive (minority) examples with replacement to match the number of negative (majority)examples [19]. The technique of making inference from resampling the dataset is calledbootstrapping [20]. Training models on these sub-samples and averaging the prediction istermed as bagging [21].

    For the given dataset, the undersampling technique was used to tackle the imbalanceddata problem. There are many techniques to perform undersampling, one such techniqueis the random undersampling. In the random undersampling method, the majority class isdownsized by removing observations at random until the dataset is balanced [17] (see figure2.1). This method (random undersampling) was chosen as it is the recommended startingpoint for practical application from [22].

    Having decided to use undersampling, the problem of finding the optimal undersamplingratio (ratio of majority class to minority class instances in the data) remains unanswered. In[23], it was argued that resampling to full balance (1:1 ratio) is not necessarily optimal forthe predictive performance of the model, and the optimal ratio differs across the datasets. Alogistic regression model was used to find the optimal balancing/undersampling ratio forthe given dataset and the predictive performance of the model was measured for differentundersampling ratios using cross-validation.

    2.3.2 Estimation of optimal undersampling ratio

    In statistical modelling and machine learning it is often a practice to split a dataset of ndata-points into three parts, ntrain, nvalidation, and ntest data. The model is trained on ntrain,tuned on nvalidation. Finally, the model is trained on ntrain + nvalidation and tested on ntest data[24].

    The given dataset was split into three sets, namely ’training,’ ’validation,’ and ’test’ set,

    5

  • 2.3. Data Pre-Processing

    in the ratio of 60-20-20. The logistic model was trained on the ’training’ dataset, and itsperformance was measured on the ’validation’ dataset.

    Figure 2.2: Plots of performance metrics vs. Undersampling ratio, optimal undersamplingratio 1.4

    The optimal undersampling ratio was determined using metrics like ’accuracy,’ ’balancedaccuracy’ (mean of positive and negative class accuracy), ’F1 score,’ and ’Positive Class Ac-curacy’ (accuracy in identifying successful conversions) on ’validation’ data (see figure2.2).A dataset with varying degree of undersampling ratio was used to train a logistic model, theperformance metrics were plotted. The optimal point (red line in figure 2.2) of the undersam-pling ratio correspondences to 1.4 (for every conversion there must be 1.4 non-conversions).

    As shown in [25], ’accuracy’ and ’F1 score’ suffer from attenuation due to imbalanceddistributions. In [26], a new metric called ’Index of Balanced Accuracy’ was used, which isbased on using a modified version of balanced accuracy (using geometric mean instead ofarithmetic). ’Balanced accuracy’ was chosen as the metric to measure performance of thelogistic model since it gave equal importance to classify both the class (’conversion’ and’non-conversion’) of data and it was readily available in popular R packages such as ’caret’[27]. Thus the undersampling ratio at which the maximum balanced accuracy was obtainedby the model was chosen as the undersampling ratio.

    6

  • 2.3. Data Pre-Processing

    Figure 2.3: Plot to check overfitting by comparing train and valid data-set balanced accuracy

    In the endeavor to train a model with good predictive performance, it is a good idea tocheck for ’overfitting’ of the model. A model that is more flexible than it needs to be, such thatit yields a small training error but a significant test error, is said to be ’overfitting’ the data [28].One can check for this issue by looking at the predictive performance on the ’training’ and’validation’ dataset(see figure 2.3).From the figure, one can see that the balanced accuracy on’training’ and ’validation’ follow a similar trend, and there is no significant deviation betweenthem, thus there is no ’overfitting’ on the data.

    2.3.3 Distribution of Data

    The data provided by Nepa consists of 9,489 rows, 26 columns with 1,745 conversions(conv == 1), and after performing the random undersampling, the data consisted of 4,188rows. The proportionality of non-conversion (conv == 0) against variables, as shown intable 2.3, is used to check that the distribution of data is not significantly changed postundersampling.

    For each variable, the ratio instance o f variable==1instances o f conv==0 is compared before and after undersampling,since the underlying proportions and thus, distribution of data mustn’t change drastically.

    Eg: For variable (ads_on_tv)

    instance o f variable==1instances o f conv==0 = 0.20 before undersampling

    instance o f variable==1instances o f conv==0 = 0.15 post undersampling

    7

  • 2.3. Data Pre-Processing

    Table 2.3: Proportion of conv==0 by column value

    Variable Pre Undersampling Post Undersampling

    Variable=0 Variable=1 Variable=0 Variable=1

    ads_on_radio_streaming 0.98 0.02 0.98 0.02ads_on_tv 0.80 0.20 0.85 0.15banner_ads_online_not_social 0.96 0.04 0.92 0.08brand_website 0.97 0.03 0.99 0.01friend_family_recommendation 0.91 0.09 0.95 0.05i_saw_an_offer_promotion 0.63 0.37 0.83 0.17i_saw_something_new 0.92 0.08 0.95 0.05instore_research 0.98 0.02 0.99 0.01magazine_or_newspaper_ads 0.95 0.05 0.96 0.04online_retailer_research 0.97 0.03 0.98 0.02online_retailer_visit 0.94 0.06 0.95 0.05online_video_ad 0.94 0.06 0.96 0.04outdoor_ads 0.95 0.05 0.95 0.05previous_shopping_list 0.98 0.02 0.98 0.02promo_coupon_leaflet_from_retailer 0.89 0.11 0.91 0.09promo_coupon_leaflet_not_from_retailer 0.95 0.05 0.96 0.04recipe_site 0.99 0.01 0.99 0.01researched_on_search_engine 0.97 0.03 0.98 0.02saw_a_product_display 0.72 0.28 0.81 0.19saw_a_sign_poster 0.92 0.08 0.95 0.05search_engine_ads 0.96 0.04 0.98 0.02social_media 0.96 0.04 0.96 0.04there_was_a_seasonal_event_or_occasion 0.99 0.01 1.00 0.00male 0.60 0.40 0.58 0.42

    8

  • 3 Theoretical Background

    This section presents the mathematical and statistical background of the various techniquesused in this thesis. Starting with Bayesian networks - their design, working, and finally, thechoice of algorithms and techniques are explained in this chapter.

    3.1 Bayesian Networks

    Before diving into Bayesian Network, one needs to understand some fundamental buildingblocks of these networks.

    Directed acyclic graphs (DAG): A DAG is a finite, directed graph with no directed cy-cles. A directed graph G = (V, E) consists of two sets: a finite set V of elements calledvertices and a finite set E of elements called edges [29]. Directed acyclic graphs have thefollowing three properties: arcs can be only directed, the network must not contain any loop,the network must not contain any cycle (starting vertex of its first edge equals the endingvertex of its last edge).

    Parent and Child node: An arc is a directed link between two nodes (random variables),usually assumed to be distinct. Nodes linked by an arc have a direct parent-child relation-ship, the node on the tail of the arc is the parent node, the node on the head of the arc is thechild node [30].

    Skeleton: An undirected graph can always be constructed from a directed or partiallydirected one by substituting all the directed arcs with undirected ones, and such a graph iscalled the skeleton or the underlying undirected graph of the original graph [30].

    V-structures: Two nodes being parents of another node such that the two parents arenot linked [30].

    9

  • 3.1. Bayesian Networks

    Figure 3.1: DAG mapping probability, source from [31]

    DAG can encode the probability distribution of X, formally DAG is called ’Independencemap’ of the probability distribution of a random variables X, with graphical separation (KKG)implying probabilistic separation (KKP) [31]. Estimation of a DAG from data is difficult andcomputationally non-trivial due to the enormous size of the space of DAGs [32].

    The graphical separation property is akin to the concept of "Markov blanket". Let V bea set of random variables, P be their joint probability distribution, and X P V, then a Markovblanket M of X is any set of variables such that X is conditionally independent of all theother variables given M [33]. Mathematically expressed as the following:

    Independency map(X, V ´ (MY X)|M) (3.1)

    Any directed arc from a variable ’A’ to ’C’ indicates direct stochastic dependencies. Thuslack of any arc connecting two nodes implies that these nodes are either i) marginally inde-pendent or ii) conditionally independent given a subset of the rest [34].

    Bayesian Networks (BNs) are defined as a class of graphical models composed by a setof random variables X = tXi, i = 1, 2, ...mu and a directed acyclic graph (DAG), denotedG = (V, E) in which each node vi P V corresponds to a random variable Xi [34].

    Figure 3.2: DAG and CPDAG, adapted from [31]

    A particular class of DAG is the Completed Partially Directed Graph (CPDAG). CPDAGis a DAG which has both directed and undirected edges [35]. The CPDAG of a (partially)

    10

  • 3.2. Learning a Bayesian Network

    directed acyclic graph is the partially directed graph built on the same set of nodes, keep-ing the same v-structures and the same skeleton, completed with the compelled arcs [30].Two DAGs having the same CPDAG are equivalent in the sense that they result in BayesianNetwork describing the same probability distribution [30].

    3.2 Learning a Bayesian Network

    The task of fitting a Bayesian network is called learning, and this has two steps: i) Learningthe structure: learning the graph, which accounts for the conditional independence present inthe data. ii) Learning the parameters: learning the parameters of local and global distributionimplied by the graph structured learnt in the previous step.

    For a given dataset D, global distribution of X, the DAG represented by G and Θ as itsparameters, the learning is given as 3.2 from [36]:

    P(G, Θ|D)looooomooooon

    Learning

    = P(G|D)looomooon

    Structural Learning

    ¨ P(Θ|G, D)looooomooooon

    Parameter Learning

    (3.2)

    From [36], one can see that for a joint probability distribution X with parameters Θ, canbe decomposed into one local distribution for each Xi, conditional on its parents (ParentsXi ).Mathematically, this can be expressed as the following:

    P(X|G, Θ) =N

    ź

    i=1

    P(Xi|ParentsXi ; ΘXi ) (3.3)

    3.2.1 Structure Learning

    From [36], one can see that structural learning consists of finding the DAG G that encodesthe dependence exhibited by data, thereby maximizing P(G|D). Thus, the structural learningpart of equation 3.2 leads to the following:

    P(G, D) = P(G) ¨ P(D|G) = P(G)ż

    P(D|G, Θ)P(Θ|G)dΘ (3.4)

    P(D|G)9ż

    P(D|G, Θ)P(Θ|G)dΘ =N

    ź

    i=1

    P(Xi|ParentsXi , ΘXi )P(ΘXi |ParentsXi )dΘXi

    ](3.5)

    In Structural learning, one can either: i) opt to rely on experts with domain knowledge,in the form of a white-list (arcs that must be included in the network) or black-list (arcs thatmust never be present in the network), ii) use the available data and perform conditionalindependence test for each arc. In [31], it was mentioned that in Bayesian Network (BN),there exists a hierarchy of variables. The variables that are thought of as "causes" placedabove the variables that are deemed as "effects" and finally, the "confounding" variables arepresent at the top of the network.

    11

  • 3.3. Bayesian Network Algorithms

    3.2.2 Parameter Learning

    From [37], one can see that if one is to assume parameters in local distribution are indepen-dent, then the parameter part of equation 3.2 becomes the following:

    P(Θ|G, D) =N

    ź

    i=1

    P(ΘXi |ParentsXi , D) (3.6)

    Post-structural learning, the problem of estimating the global distribution parameters, isbroken into estimating parameters of local distributions. Of the many methods possible toestimate the parameters, Bayesian Posterior Estimators was used since in [38] it was hintedthat the posterior could be represented in a compact factorized form and the computation canbe fastened by using the Bayes theorem.

    3.3 Bayesian Network Algorithms

    Having seen the two types of learning that occur in Bayesian Networks, now it is time tolearn about various algorithms that can be used to build/learn a Bayesian Network. Thereare broadly three types of algorithms: Constraint-based algorithms, Score-based algorithmsand Hybrid algorithms.

    3.3.1 Constraint-based algorithms

    Constraint-based algorithms are largely based on the work of Pearl and Verma [39][40]. Thecore principle here is that the model learns a DAG using conditional Independence tests. Thecommonly used tests for conditional independence are ’mutual information test’ and ’exactstudent’s t-test’ for discrete and continuous BN’s (Bayesian Network) respectively [34].

    Some examples of constraint-based algorithms are: PC, Grow-Shrink (GS), Max-Min Par-ents & Children (MMPC), etc. From [34], a template for constraint-based structure learningalgorithms is given below:

    12

  • 3.3. Bayesian Network Algorithms

    Algorithm 1 A template for constraint-based structure learning algorithmsInput: a dataset containing the variables Xi; i = 1, 2, ......mOutput: a completed partially directed acyclic graph

    (1) Phase 1: learning Markov blankets (optional)

    (a) For each variable Xi, learn its Markov blanket B(Xi)

    (b) Check whether the Markov blankets B(Xi) are symmetric, e.g. Xi P B(Xj) ô Xj PB(Xi). Assume that nodes for which symmetry does not hold are false positivesand drop them from each other’s Markov blankets.

    (2) Phase 2: learning neighbours

    (a) For each variable Xi, learn the set N(Xi) of its neighbours (i.e., the parents and thechildren of Xi). Equivalently, for each pair Xi, Xj, i ‰ j search for a set SXi ,Xj ĂV (including SXi ,Xj = H) such that Xi and Xj are independent given SXi ,Xj andXi, Xj R SXi ,Xj . If there is no such a set, place an undirected arc between Xi andXj(Xi ´ Xj). If B(Xi) and B(Xj) are available from points Phase 1 (a) and (b), thesearch for SXi ,Xj can be limited to the smallest of B(Xi)\Xj and B(Xj)\Xi

    (b) Check whether the N(Xi) are symmetric, and correct asymmetries as in step Phase1 (b)

    (3) Phase 3: learning arc directions

    (a) For each pair of non-adjacent variables Xi and Xj with a common neighbour Xk,check whether Xk P SXi ,Xj . If not, set the direction of the arcs Xi ´ Xk and Xk ´ Xjto Xi Ñ Xk and Xk Ð Xj to obtain a v-structure Vl = Xi Ñ Xk Ð Xj

    (b) Set the direction of arcs that are still undirected by applying the following tworules recursively:

    (i) If Xi is adjacent to Xj and there is a strictly directed path from Xi to Xj (a pathleading from Xi to Xj containing no undirected arcs) then set the direction ofXi ´ Xj to Xi Ñ Xj

    (ii) If Xi and Xj are not adjacent but Xi Ñ Xk and Xk ´ Xj , then change the latterto Xk Ñ Xj

    3.3.2 Score-based algorithms

    Score-based algorithms are heuristic-based optimization techniques. Thus these models tryto optimize some score (eg: BIC, Mutual Information, Log-Likelihood Ratio) for a givenDAG. Score-based algorithms tend to produce models with higher likelihood compared toconstraint-based and hybrid algorithms and produce the largest networks allowing goodpropagation of evidence [37].

    Most scores have tuning parameters, whereas conditional independence tests (mostly)do not. Thus, as mentioned in [41], to select the optimal learning parameters for score-basedalgorithms, an effective method is to perform a grid search.

    Some examples of score-based algorithms are: Hill Climbing and Tabu Search. The HillClimbing algorithm is as follows[30]:

    13

  • 3.3. Bayesian Network Algorithms

    Algorithm 2 Hill Climbing Algorithm

    (1) Choose a network structure G over V, usually (but not necessarily) empty

    (2) Compute the score of G, denoted as ScoreG = Score(G)

    (3) Set maxscore = ScoreG

    (4) Repeat the following steps as long as maxscore increases:

    (a) for every possible arc addition, deletion or reversal not resulting in a cyclic net-work:

    (i) compute the score of the modified network G˚, ScoreG˚ = Score(G˚):(ii) if ScoreG˚ > ScoreG, set G = G˚ and ScoreG = ScoreG˚

    (b) update maxscore with the new value of ScoreG

    (5) Return the DAG G

    3.3.3 Hybrid algorithms

    Hybrid algorithms combine constraint-based and score-based algorithms to offset the respec-tive weaknesses and produce stable network structures [30]. These algorithms reduce thesolution space of DAG using conditional independence tests and use the score to zone in onan optimal DAG [30].

    As mentioned in [31], constraint-based algorithms are usually faster, score-based algo-rithms are more stable and hybrid algorithms are at least as good as score-based algorithms,and often a bit faster.

    Some examples of constraint-based algorithms are: Max-Min Hill Climbing (MMHC) andGeneral 2-Phase Restricted Maximization. From [42] the Max-Min Hill Climbing (MMHC)algorithm is as follows:

    14

  • 3.3. Bayesian Network Algorithms

    Algorithm 3 MMPC and MaxMinHeuristic Algorithm

    procedure: MMPC(T, D)

    (1) Input: target variable T; data D

    (2) Output: the parents and children of T in any Bayesian

    (3) network faithfully representing the data distribution

    (4) Phase I: Forward

    (a) CPC = ∅(b) repeat

    (i) ă F, assocF ą= MaxMinHeuristic(T; CPC)(ii) if assocF ‰ 0 then

    (I) CPC = CPCY F(iii) end if

    (c) until CPC has not changed

    (5) Phase II: Backward

    (a) for all X P CPC do(b) if DS Ď CPC, s.t. Ind(X; T|S) then

    (i) CPC = CPC\tXu(ii) endif

    (c) end for

    (6) return CPC

    end procedure

    procedure: MaxMinHeuristic(T, CPC)

    (1) Input: target variable T; subset of variables CPC

    (2) Output: the maximum over all variables of the minimum association with T relative toCPC, and the variable that achieves the maximum

    (3) assocF = maxXPV MinAssoc(X; T|CPC)

    (4) F = argmaxXPV MinAssoc(X; T|CPC)

    (5) return ă F, assocF ą

    end procedure

    15

  • 3.4. Conditional independence tests

    Algorithm 4 MMPC and MMHC Algorithmprocedure: MMPC(T, D)

    (1) CPC = MMPC(T, D)

    (2) for every variable X P CPC do

    (a) if T R MMPC(X, D)then(i) CPC = CPC\X

    (ii) end if

    (b) end for

    (3) return CPC

    end procedure

    procedure: MMHC(D)

    (1) Input: data D

    (2) Output: a DAG on the variables in D

    (3) Restrict

    (a) for every variable X P V do(i) PCX = MMPC(X, D)

    (ii) end for

    (4) Search

    (5) Starting from an empty graph perform Greedy Hill-Climbing with operators add-edge,delete-edge, reverse-edge. Only try operator add-edge Y Ñ X if Y P PCX

    (6) Return the highest scoring DAG found

    end procedure

    3.4 Conditional independence tests

    Log-likelihood ratio is one of the conditional independence tests used under the structurallearning by Bayesian Networks [30]. Conditional independence tests focus on the presenceor absence of individual arcs, since each arc/edge encodes a probabilistic dependence, con-ditional independence tests can be used to assess whether that probabilistic dependence issupported by the data [30]. Thus, if the null hypothesis that states conditional independenceis rejected (eg: chi-square test [42]), then the arc can be considered for inclusion in the arc [30].

    For a three nodes network 1A1, 1B1 and 1C1 in a DAG, the null hypothesis, 1A1 is proba-bilistically independent (KKP) from 1B1 given 1C1, thus Log-Likelihood Ratio G2 is given by3.9 from [30].

    H0 : A KKP B|C (3.7)

    H1 : AK/KP B|C (3.8)

    16

  • 3.5. Inference on Bayesian Network

    G2(A, B|C) =ÿ

    aPA

    ÿ

    bPB

    Lÿ

    cPC

    nabcn¨ log nabcn++c

    na+cn+bc(3.9)

    In 3.9 the categories of 1A1 are denoted with a P A, categories of 1B1 with b P B andcategories of 1C1 with c P C. The number nabc denotes the number of observations for thecombination of a category a of A, a category b of 1B1 and a category c of 1C1. The use of a "+"subscript denotes the sum over an index and is used to indicate the marginal counts for theremaining variables. So, for example, na+c is the number of observations for a and c obtainedby summing over all the categories of 1B1, from [30].

    A good example of the usage of conditional independence is given in [42]. The p-valuesfrom the χ2 test are used to accept or reject the null hypothesis (at significance level α).For the variables where the null hypothesis cannot be rejected, then a test for measuringassociation is performed (eg: G2), where lower p-values implies higher association.

    3.5 Inference on Bayesian Network

    Suppose one has a Bayesian network B, with the DAG G and parameters Θ. Evidence E isused to calculate the posterior distribution i.e P(X|E), and this remains the most powerfulaspect of Bayesian Networks [43]. For example, if the random variables of the network areX, Y and Z, suppose Z takes on value z in evidence E1, then E1 = (?, ?, z) and P(X|E1) =P(X|Z = z), from [43].

    From [30], the posterior probability is as:

    P(X|E, B) = P(X|E, G, Θ) (3.10)The queries on the Bayesian network is termed as "conditional probability queries (CPQ)".

    Conditional probability queries are concerned with the distribution of a subset of variablesQ = Xj1, ..., Xjl given some evidence E on another set Xi1, ..., Xik of variables in X [30]. Thetwo sets of variables can be assumed to be disjoint and the distribution under the interest isgiven by equation 3.11 while the Marginal Posterior Probability Distribution given by equa-tion 3.12, from [30].

    CPQ(Q|E, B) = P(Q|E, G, Θ) = P(Xj1, ....Xjl|E, G, Θ) (3.11)

    P(Q|E, G, Θ) =ż

    P(X|E, G, Θ)d(X\Q) (3.12)

    The process of propagating the effects of evidence is called belief propagation: belief Xusing Bayesian Network B is updated using the evidence E and is given by 3.13 from [30].

    Thus 3.12 becomes:

    P(Q|E, G, Θ) =ź

    i:XiPQ

    ż

    P(Xi|E, ParentsXi , ΘXi )dXi (3.13)

    There are two types of algorithms for belief propagation, ’Exact Inference’ and ’Approxi-mate Inference’.

    3.5.1 Exact Inference

    Exact inference algorithms combine repeated applications of Bayes’ theorem with local com-putations to obtain exact values P(Q|E, G, Θ). Due to their nature, such algorithms are feasi-ble only for small or very simple networks, such as trees and polytrees [30]. In the worst case,their computational complexity is exponential in the number of variables [30].

    17

  • 3.6. Logistic Regression

    3.5.2 Approximate Inference

    Approximate inference algorithms use Monte Carlo simulations to sample from the globaldistribution of X and thus estimate P(Q|E, G, Θ). In particular, they generate a large num-ber of samples from B and estimate the relevant conditional probabilities by weighting thesamples that include both E and Q = q against those that include only E [30]. In computerscience, these random samples are often called particles, and the algorithms that make use ofthem are known as particle filters or particle-based methods [30].

    There exists two techniques for approximate inference, namely ’logic sampling’ and ’like-lihood sampling’. From [30], logic sampling combines rejection sampling and uniformweights, essentially counting the proportion of generated samples including E that alsoinclude Q = q. However this method has a flaw in terms of high rejection of samples if theevidence E is a rare occurrence [30].

    This flaw of rejecting a high number of sampling is fixed using ’likelihood weighting’where samples generated include the evidence E by design. However, this means that oneis not sampling from the original Bayesian Network anymore, but sampling from a secondBayesian Network in which all the nodes Xi1, ..., Xik in E are fixed. This network is called themutilated network [30].

    3.6 Logistic Regression

    Logistic Regression is a technique to model the probability of an observation belonging toa specific class, mathematically "the expected value of Y, given the value(s) of X"[44] can beexpressed as the following:

    E(yi|xi) = S(β0 + β1 ¨ xi) (3.14)

    whereS(t) =

    11 + exp(´t) (3.15)

    The β in equation 3.14 is a vector of coefficients which are the parameters for the model,generally estimated using maximum likelihood estimation(MLE). In logistic model, the out-put variable yi is a Bernoulli random variable, and the probability of yi = 1 is given by:

    P(yi = 1|xi) = S(xiβ) (3.16)

    3.6.1 Log-likelihood

    The log-likelihood for a given observation xi, yi is given by equation 3.17

    L(β; yi, xi) = [S(xi, β)]yi [1´ S(xi, β)]1´yi (3.17)

    The log-likelihood of the logistic model with N observation, X input feature vector andy output vector is given by 3.18

    l(β; y, X) =N

    ÿ

    i=1

    [´ ln(1 + exp(xiβ)) + yixiβ] (3.18)

    Proof of equation 3.18:

    18

  • 3.7. Metrics for Evaluation of Model Performance

    l(β;y,X)=ln(L(β;y,X))

    =ln(

    śNi=1[S(xi β)]

    yi [1´S(xi β)]1´yi)

    =řN

    i=1[yi ln(S(xi β))+(1´yi) ln(1´S(xi β))]

    =řN

    i=1

    [yi ln

    (1

    1+exp(´xi β)

    )+(1´yi) ln

    (1´ 1

    1+exp(´xi β)

    )]=

    řNi=1

    [yi ln

    (1

    1+exp(´xi β)

    )+(1´yi) ln

    (1+exp(´xi β)´1

    1+exp(´xi β)

    )]=

    řNi=1

    [yi ln

    (1

    1+exp(´xi β)

    )+(1´yi) ln

    (exp(´xi β)

    1+exp(´xi β)

    )]=

    řNi=1

    [ln(

    exp(´xi β)1+exp(´xi β)

    )+yi

    (ln(

    11+exp(´xi β)

    )´ln

    (exp(´xi β)

    1+exp(´xi β)

    ))]=

    řNi=1

    [ln(

    exp(´xi β)1+exp(´xi β)

    exp(xi β)exp(xi β)

    )+yi

    (ln(

    11+exp(´xi β)

    1+exp(´xi β)exp(´xi β)

    ))]=

    řNi=1

    [ln(

    11+exp(xi β)

    )+yi

    (ln(

    1exp(´xi β)

    ))]=

    řNi=1[ln(1)´ln(1+exp(xi β))+yi(ln(1)´ln(exp(´xi β)))]

    =řN

    i=1[´ ln(1+exp(xi β))+yixi β]

    3.6.2 The Hessian

    The second derivatives of the log-likelihood with respect to the parameter β is called theHessian of the log-likelihood. This is given by 3.19:

    5ββl(β; y, x) = ´N

    ÿ

    i=1

    xTi xiS(xiβ)[1´ S(xiβ)] (3.19)

    Proof of equation 3.19:

    ∇ββ l(β;y,X)=∇β(∇β l(β;y,X))

    =∇β(

    řNi=1[yi´S(xi β)]xi

    )=´

    řNi=1 xi∇βS(xi β)

    =´řN

    i=1 xJi xiS(xi β)[1´S(xi β)]

    In [45], it was shown that using the central limit theorem the distributions of parameters(β) can be approximated by a normal distribution with mean equal to the true parametervalue (from MLE β̂) and the covariance given by the inverse of hessian. Mathematically, thisis as the following:

    N

    β̂,´ [´ Nÿi=1

    xTi xiS(xiβ)[1´ S(xiβ)]]´1 (3.20)

    3.7 Metrics for Evaluation of Model Performance

    Having seen the two modeling techniques, it is time to focus on the metrics that allow us tocompare and evaluate the predictive performance of models.

    3.7.1 Bayesian Information Criterion (BIC)

    One of the scores used by score-based algorithms is the (Bayesian Information Criterion) BIC,which provides a goodness of fit metric, from [37]:

    BIC(G) =N

    ÿ

    i=1

    log P(Xi|ParentsXi )´|ΘXi |

    2log(n) (3.21)

    19

  • 3.8. Techniques for Measuring Association

    3.7.2 F1 score

    Table 3.1: Confusion matrix for a binary class problem

    Predicted positive Predicted negative

    Positive class True Positive (TP) False Negative (FN)Negative class False Positive (FP) True Negative (TN)

    The harmonic mean of precision and recall is termed as "F1 score".

    precision =true positives

    true positives + f alse positives(3.22)

    recall =true positives

    true positives + f alse negatives(3.23)

    F1 = 2 ¨ precision ¨ recallprecision + recall

    (3.24)

    3.7.3 Accuracy

    Accuracy is the fraction of cases for which the model is correct [46] and is given by the fol-lowing:

    Accuracy =true positives + true negatives

    true positives + true negatives + f alse positives + f alse negatives(3.25)

    3.7.4 Positive Class Accuracy

    Positive Class Accuracy or Positive predictive value is defined as:

    Positive Class Accuracy =true positives

    true positives + f alse positives(3.26)

    3.7.5 Balanced Accuracy

    Balanced Accuracy is the arithmetic mean of ’Recall (Sensitivity)’ and ’Specificity’. BalancedAccuracy is good metric to be used in case of imbalanced dataset [47] and is given by theequation 3.29.

    RecallorSensitivity =true positives

    true positives + f alse negatives(3.27)

    Speci f icity =true negatives

    true negatives + f alse positives(3.28)

    Balanced Accuracy =true positives

    true positives+ f alse negatives +true negatives

    true negatives+ f alse positives

    2(3.29)

    3.8 Techniques for Measuring Association

    In modelling it is often important to understand the variables/columns which are associatedwith the outcome variable/column(variable ’conv’). There are three techniques that are ex-plored and used in the project, i) Chi-Square Test, ii) Brute Force Search, and iii) MultipleCorrespondence Analysis.

    20

  • 3.8. Techniques for Measuring Association

    3.8.1 Chi-Square Test (χ2 test)

    Chi-Square (χ2) test can be performed on a 2 × 2 contingency table and can be a simplemeasure of association between other variables and the outcome variable (variable ’conv’).One of the advantages of this test is that it is non-parametric thus is robust with respect tothe distribution of the data [48].

    Let X1, X2, ....Xn be independent samples from some distribution, such that

    P(Xij = 1) = 1´ P(Xij = 0) = pj, f or all 1 ď j ď k) (3.30)

    Each Xi consists of exactly k´ 1 zeros and a single one, where the one is in the componentof the success category at trial i. The implication of equation 3.30 is that Var(Xij) = pj ¨ (1´pj), from [49].

    The Pearson chi-square statistic is given by:

    χ2 =k

    ÿ

    j=1

    (Nj ´ n ¨ pj)2

    n ¨ pj(3.31)

    χ2 is Pearson’s cumulative test statistic which follows asymptotically a χ2 distributionwhen there is no association between the two variables [49].

    3.8.2 Brute Force Search

    For a given data and all the features, there exists an optimal solution that can model theoutcome variable as a function of other variables. This exercise can be perceptive as a searchproblem with the objective of finding the model with the maximum accuracy/minimum BIC,with a guaranteed global optimal solution or Brute Force Search. The biggest issue here is thetime taken for search O(nk), [50]. However, from [50], any possibility of reducing the solutionspace, reduces the search time to a more reasonable scenario.

    3.8.3 Multiple Correspondence Analysis

    Multiple correspondence analysis (MCA) aims at analyzing the structure of relationshipsexisting in a set of categorical variables, by explaining the associations through their pro-jection in a space with a reduced number of dimensions almost always two [51]. MCA isan extension of correspondence analysis (CA), which allows one to analyze the pattern ofrelationships of several categorical dependent variables. MCA is largely based on PrincipleComponent Analysis, with a difference that variables are categorical instead of quantitative[52]. Both CA and MCA are both special cases of weighted principal component analysis[53]. Furthermore, MCA can be obtained using correspondence analysis on an indicatormatrix, while the interpretation of correspondence analysis needs to be adapted.

    Multiple correspondence analysis also has an attractive property of optimality of scale valueswith some adjustment (thanks to achieving maximum intercorrelation and thus maximumreliability in terms of Cronbach’s alpha) [53].

    21

  • 3.8. Techniques for Measuring Association

    Figure 3.3: MCA notation, source[54]

    Consider the following: If there are I observations, K variables (nominal) with Jk lev-els, such that

    ř

    Jk = J then the indicator matrix is denoted by X. Performing "Correspon-dence Analysis" on the matrix X provides us with two-factor scores, one for rows and one forcolumns (see figure 3.4).

    Figure 3.4: Point cloud of categories/columns, source [54]

    The variance of a category K is given by equation 3.32.

    Var(k) = d2(k, O) =I

    ÿ

    i=1

    1I

    x2ik =I

    ÿ

    i=1

    (yikpk´ 1)2

    =1pk´ 1 (3.32)

    The implication of 3.32 is that as a category K becomes rarer/uncommon, the distancebetween point Mk representing category k and origin O becomes more substantial, thus inMCA, uncommon categories are located away from the origin [55].

    The distance between pairs of categories is given by equation 3.33.

    d2(k, k1) =I

    ÿ

    i=1

    (yikpk´ yik1

    p1k

    )2=

    pk + pk1 ´ 2pk pk1pk pk1

    (3.33)

    22

  • 3.8. Techniques for Measuring Association

    The implication of 3.33 is between overlapping individual, if the proportion of individualis highly skewed towards one category, then this leads to a larger distance between the twocategories [55].

    The inertia of the k-th category given by equation 3.34.

    Inertia(k) =pkJ¨ d2(k, O) = 1´ pk

    J(3.34)

    The implication of 3.34 is that MCA identifies rare categories but does not exaggerate theinfluence of sporadic ones [55].

    23

  • 4 Methods

    This chapter provides an overview of the steps followed in the project. The aim of this chapteris to facilitate replicability conforming to the expectations from a scientific report.

    4.1 Baseline Logistic Regression Model

    To compare the viability of Bayesian Network as a market attribution model, a Logistic Modelwas created as a benchmark. Two constraints placed on the model were:

    • No variable should be dropped from the model as the end-users of the model are inter-ested in attribution values for all touch-point

    • No interaction terms should be present since this would result in multiple combina-tions, often tricky to evaluate.

    4.1.1 Optimal Decision Threshold of Logistic Regression

    Cross-validation was used to find the optimal decision threshold/cutoff (probability at whichthe sample is classified as 1 in binary 0,1 data) where model metrics such as F1 score, accuracy,balanced accuracy etc. were plotted against the cutoff. The point at which the balanced accu-racy (as seen in figure 4.1) was maximum is deemed as the optimal decision threshold/cutoff.Furthermore, the overfitting was checked visually in figure 4.2.

    24

  • 4.2. Bayesian Network Model

    Figure 4.1: Plot showing the model metrics vs. the decision boundary cutoff

    Figure 4.2: Testing for over-fitting at different decision boundary cutoff

    4.2 Bayesian Network Model

    Having created our benchmark model and tuned for the best possible performance, the nextstep is the design and validation of the Bayesian network. To create a Bayesian network, thefollowing choices have to be made:

    25

  • 4.2. Bayesian Network Model

    • White-list - The nodes which should always be connected by an edge

    • Black-list - The nodes which should not be connected by an edge

    • Number of Restarts – The number of random restarts which the algorithm runs for toconverge at an optima (hopefully global)

    • Algorithm - The algorithm to be used for building the Bayesian Network

    4.2.1 White-list Creation Using Hypothesis Testing

    To understand the dataset given and interactions with ’conversion’, hypotheses formulatedwere tested using the ’Chi-square test’ (see section 3.8.1). These hypotheses were validatedby the supervisor at Nepa (Viking Fristrom).

    The following is the result of the hypotheses testing:

    Table 4.1: Hypothesis Table with P-values, tested at 95% significance

    Variable Hypothesis P-values Result

    ads_on_radio_streaming Creates brand awareness but does not immediately lead to purchase 0.525237381 Hypothesis rejectedads_on_tv Creates brand awareness but less than radio, does not immediately lead to purchase 0.00049975 Failed to rejectbanner_ads_online_not_social Most likely to get blocked, but create brand awareness 0.00049975 Failed to rejectbrand_website Customer actively checking a brand’s website may lead to a long term customer 0.00149925 Failed to rejectfriend_family_recommendation Depending on how much we value them, recommendation may lead to purchase 0.00049975 Failed to rejecti_saw_an_offer_promotion Works for budget shoppers, may led to few purchases 0.00049975 Failed to reject

    i_saw_something_new Very similar to i_saw_heard_something_else_in_store unlikely to lead to purchase sincethis is largely depended on type of people 0.00049975 Failed to reject

    instore_research People who research about products are already invested on the decision to purchase 0.00049975 Failed to reject

    magazine_or_newspaper_ads Similar to saw_a_product_display, Creates brand awareness butdoes not immediately lead to purchase 0.051974013 Hypothesis rejected

    online_retailer_research This may still be window shopping but a bit more likely to purchase 0.270364818 Hypothesis rejectedonline_retailer_visit People are mostly up to window shopping but sometimes purchase 0.053473263 Hypothesis rejectedonline_video_ad Would create brand awareness but might lead to being annoyed 0.011494253 Failed to rejectoutdoor_ads Creates brand awareness but does not immediately lead to purchase 0.781109445 Hypothesis rejectedprevious_shopping_list Things present on previous shopping list is part of the routine 0.24037981 Hypothesis rejectedpromo_coupon_leaflet_from_retailer Works for budget shoppers, may led to few purchases 0.050474763 Hypothesis rejectedpromo_coupon_leaflet_not_from_retailer Works for budget shoppers, may led to few purchases 0.009995002 Failed to rejectrecipe_site If a recipe calls for an item, one is highly likely to purchase it 0.888055972 Hypothesis rejectedresearched_on_search_engine Similar to in-store_research, highly likely to make the purchase 0.020989505 Failed to rejectsaw_a_product_display Creates brand awareness but does not immediately lead to purchase 0.00049975 Failed to rejectsaw_a_sign_poster Creates brand awareness but does not immediately lead to purchase 0.00049975 Failed to rejectsearch_engine_ads Same as billboard but people use ad blocks and this may only led to brand awareness 0.020489755 Failed to reject

    social_media Social Media is tricky and this can lead to a loyal customer to asimple ad, thus weak expectation 0.864067966 Hypothesis rejected

    there_was_a_seasonal_event_or_occasion During the seasonal event, one is highly likely to purchase correspondingitem (gul during Christmas) 0.033983008 Failed to reject

    male Women tend to respond better to in-store touch-points such as coupons 0.284857571 Hypothesis rejected

    The variable which showed association with ’conversion’ using chi-square test at 0.05significance level were considered as candidate variable for white-list. However, this testhas its flaws in terms of assuming no-confounding variables (influences both the dependentvariable and independent variable [56]). But these may still prove to be a good starting pointto be included as white-list.

    4.2.2 White-list Creation Using Multiple Correspondence Analysis

    Using Multiple Correspondence Analysis (see 3.8.3) on the data, variables that contributedhighly towards Eigenvectors (associated with conversion (conv == 1)) were obtained. Thesevariables will be evaluated as white-list.

    26

  • 4.2. Bayesian Network Model

    Figure 4.3: Plot showing the Variance Explained vs. Eigenvectors

    Figure 4.4: Plot showing the contribution of variables towards Eigenvector 1

    27

  • 4.2. Bayesian Network Model

    Figure 4.5: Plot showing the contribution of variables towards Eigenvector 2

    4.2.3 White-list Creation Using Grid Search

    A brute-force technique to find the variables that are related to ’conversion’ (to be usedas white-list) is to try out all the combinations of variables present in the data. However,this would be computational too expensive, as mentioned in [50] and these amount tořN

    i=1 (25i ) = 33, 554, 431 combinations. Leveraging some domain knowledge, my supervisor

    (Anna Lundmark) at Nepa, reduced this space to about 55,455 combinations.

    The plots 4.6 and 4.7 show Bayesian network build with different combinations of vari-ables and using the hill-climbing (considered stable compared to constraint-based [31]) aswhite-list and their corresponding performance using BIC score and Balanced accuracy.The combination of variables with the best performance (highest balanced accuracy) wasselected.

    28

  • 4.2. Bayesian Network Model

    Figure 4.6: BIC vs. Iteration using Grid Search

    Figure 4.7: Balanced Accuracy vs. Iteration using Grid Search

    4.2.4 Black-list Creation Using Domain Knowledge

    Unlike the white-list, there is no data-driven method to create blacklist since this is largelybased on domain knowledge. Thus with the help of my supervisor (Anna Lundmark), 132blacklist rules were created.

    29

  • 4.3. Bayesian Network Model Comparison

    4.2.5 Optimal Restart Using Cross-Validation

    As mentioned in [57], random restarts are a better method than a grid search for hyperparam-eter tuning. However, the optimal number of restarts can be determined using a grid searchand cross-validation. The figure 4.8 shows the effect of restarts vs. balanced accuracy.

    Figure 4.8: Balanced Accuracy vs. Random Restarts

    4.3 Bayesian Network Model Comparison

    The techniques mentioned in sections 4.2.1, 4.2.2, and 4.2.3 provide us with different white-lists as a starting point for our Bayesian Network models. Section 4.2.4 provided us with theblack-list edges, allowing us to infuse domain knowledge into the Bayesian Network models.Furthermore, 4.2.5 provided us with an optimal range for the number of restarts. Evaluationof the different algorithms that can be used to learn from the data needs to be performed.

    These models learn using different algorithms and will be benchmarked against the base-linemodel (see 4.1) and evaluated on the metrics such as BIC, accuracy, balanced accuracy etc.

    4.4 Attribution Formula

    Two attribution formulas are used in this project. Using these formulas, the efficacy of thetouchpoints/variables are measured. Equation 4.1 is used for the Bayesian Network and istaken from [6] and [30]. Equation 4.1 adapted for logistic model yields equation 4.2.

    Attribution(w) = E[conversion = 1|w = 1]´ E[conversion = 1|w = 0] (4.1)

    Attribution(w) =1

    1 + exp´[(β0 + β1 ¨w)]´ 1

    1 + exp´[(β0)](4.2)

    30

  • 5 Results

    In this section, the results for the models described in chapter 4 are presented. The modelsusing different white-list methods and algorithms are compared and benchmarked againstthe base-line logistic model. Both the models are trained on the dataset formed by combining’training’, ’validation’ dataset and tested on the ’test’ dataset, as mentioned in section 2.3.2.The distribution of attribution for touchpoints/variables is also shown.

    5.1 Baseline Logistic Regression Model

    Table 5.1: Performance Metrics for Logistic Model

    Model Balanced Accuracy F1 Score Accuracy Positive Class Accuracy

    Logistic Model 0.5974451 0.7067669 0.627685 0.5357143

    5.2 Bayesian Network Models

    Shown below are the performance metrics of various models sorted in decreasing order ofbalanced accuracy.

    Table 5.2: Performance of Models using different algorithms and white-list techniques

    Model Balanced Accuracy F1 score Accuracy Positive Class Accuracy BIC

    Hill climbing model white-list variables using Chi-sq test 0.6100052 0.7353207 0.650358 0.5822785 -86189.53MMHC model white-list variables using Chi-sq test 0.5995583 0.7264493 0.6396181 0.5625 -86253.907Tabu model white-list variables using Chi-sq test 0.5989452 0.7311828 0.6420048 0.5701754 -86190.331RSMAX2 model white-list variables using Chi-sq test 0.593064 0.7221719 0.6336516 0.5523013 -86260.938RSMAX2 model white-list variables using Grid Search 0.5920758 0.7210145 0.6324582 0.55 -8530597.499Hill climbing model white-list variables using Grid search 0.5910877 0.7198549 0.6312649 0.5477178 -8530522.464MMHC model white-list variables using Grid Search 0.5805931 0.7131222 0.6217184 0.5313808 -8530593.077Tabu model white-list variables using Grid Search 0.570051 0.7086331 0.6133652 0.5172414 -8530523.264RSMAX2 model white-list variables using MCA 0.5116315 0.7356863 0.597852 0.4637681 -20271.464MMHC model white-list variables using MCA 0.505655 0.7315542 0.5918854 0.4285714 -20287.556Hill climbing model white-list variables using MCA 0.5051371 0.7319749 0.5918854 0.4264706 -20220.175Tabu model white-list variables using MCA 0.4986904 0.7264151 0.5847255 0.3888889 -20220.975

    31

  • 5.3. Distribution of Attribution

    Figure 5.1: Plot of Bayesian Network model with the highest accuracy, build using Hill climb-ing algorithm with white-list using Chi-square test

    5.3 Distribution of Attribution

    From table 5.2, the model with the highest accuracy (as well as balanced accuracy) is chosenas the best performing model and benchmarked against the baseline for attribution values.The distribution of attribution values for variables is shown in the figure 5.2. For the logisticregression model, the distribution of attribution values (figure 5.3) was calculated using theHessian of the likelihood of the logistic regression (see section 3.6.2).

    32

  • 5.3. Distribution of Attribution

    Figure 5.2: Distribution of attribution values for variables from Bayesian Network Model

    33

  • 5.4. Comparison of Attribution

    Figure 5.3: Distribution of attribution values for variables from Logistic Model

    5.4 Comparison of Attribution

    The mean values from the distribution of attribution values (see 5.2 and 5.3) are given belowfor both the models.

    34

  • 5.4. Comparison of Attribution

    Table 5.3: Mean attribution values of variable for Bayesian Network Model and LogisticModel

    Variable Bayesian Network Model Logistic Regression Model BN Model Rank Log Model Rank

    there_was_a_seasonal_event_or_occasion 0.31 0.04 1 12instore_research 0.27 -0.07 2 23i_saw_an_offer_promotion 0.22 0.10 3 7saw_a_sign_poster 0.11 0.05 4 10researched_on_search_engine 0.10 0.13 5 5friend_family_recommendation 0.10 0.05 6 8saw_a_product_display 0.09 0.21 7 3search_engine_ads 0.07 0.16 8 4i_saw_something_new 0.07 0.11 9 6online_video_ad 0.06 -0.05 10 22promo_coupon_leaflet_not_from_retailer 0.05 0.03 11 13ads_on_radio_streaming 0.04 -0.03 12 19ads_on_tv 0.03 0.39 13 1online_retailer_research 0.02 -0.02 14 18brand_website 0.01 0.02 15 16outdoor_ads 0.01 0.05 16 11previous_shopping_list 0.00 -0.01 17 17magazine_or_newspaper_ads 0.00 0.05 18 9social_media 0.00 -0.18 19 24promo_coupon_leaflet_from_retailer 0.00 -0.03 20 21online_retailer_visit 0.00 0.02 21 15recipe_site 0.00 0.26 22 2male 0.00 0.02 23 14banner_ads_online_not_social -0.17 -0.03 24 20

    35

  • 6 Discussion

    6.1 Results

    From tables 5.1 and 5.2, one can see that the Bayesian network outperforms the logisticregression in terms of predictive performance by about 3% in terms of accuracy. Of the manyalgorithms that can be used to build the Bayesian Network, the ’Hill-climbing’ algorithmwith the ’Chi-Square Test’ for creating a white-list was found to perform the best. As men-tioned in [37], the predictive performance of these algorithms is principally data specific.However, these results still contradict the results obtained in [37] (constraint-based algorithmbeing more accurate than score-based algorithm). Furthermore, one finds that using thebrute force grid search technique to find white-list variables yields models with the highestBIC values. Still, they don’t necessarily predict better than other models, further reinforcingthe value of domain knowledge.

    Comparing figures 5.2 and 5.3, it is evident that the attribution values for Bayesian Net-work have much lower variance than the Logistic Model. The low variance in attributionvalues could be due to the choice of model or the formula used to calculate attribution(equation 4.1 vs. 4.2).

    From table 5.3, one can see that the relative ranking of variables in terms of their attri-bution values varies significantly between the two models. Assuming that the Bayesiannetwork model is closer to the true model (owing to higher accuracy values), let us discusswhy the top five variables (ranked by attribution values) for the logistic model does notreflect the reality.

    • ads_on_tv: has the highest variance in figure 5.3, and thus the attribution efficacy isunreliable. Furthermore, in figure 5.1, one can see that this is a confounding variable(influences both the dependent variable and independent variable [56]), thereby ex-plaining the low rank (13) under Bayesian Network.

    • recipe_site: from the model summary in Appendix, table A.1, we can see that it is not asignificant variable considering the p-values. Hence, the very inclusion of this variableunder the model can be questioned. Under Bayesian Network figure 5.1, it is evidentthat this variable is a terminal node, thus explaining the low rank (22).

    36

  • 6.2. Method

    • saw_a_product_display: is ranked 7 under Bayesian Network, suggesting mutualagreement between the two models. Furthermore, one can see that it has a low vari-ance in figure 5.3, even though it is a confounding variable according to figure 5.1.

    • search_engine_ads: is ranked 8 under Bayesian Network, suggesting mutual agreementbetween the two models. It also has a moderately high variance in figure 5.3, and it isan indirect influencing variable (ancestral node, connected to parents of ’conv’ node)according to figure 5.1. Furthermore, from the model summary in Appendix, table A.1,one can see that it is not a significant variable considering the p-values.

    • researched_on_search_engine: is ranked the same under Bayesian Network, suggestingperfect agreement between the two models. Furthermore, one can see that it has amoderately low variance in figure 5.3, and it is a direct influencing variable (parentnode of ’conv’ node) according to figure 5.1.

    Due to the domain-specific data used in this project, the only authority to validate theattribution values and its ranking would be the supervisors at Nepa. The following are theircomments on the same.

    We find that the attribution numbers from the Bayesian model corresponds well to our expec-tation. We usually see that attribution values of ’instore_research’ being higher, as well as for’i_saw_an_offer_promotion’. The touch-point ’there_was_a_seasonal_event_or_occasion’ is a raretouch-point due to its very nature, but we think it makes sense that it has a high attribution value.

    6.2 Method

    In this section, the methodology followed throughout the project is discussed.

    6.2.1 Undersampling

    One of the significant decisions in the project was the data pre-processing, i.e., undersam-pling. This technique of removing the majority class data may seem counter-intuitive to theethos of ’more data is better’ in statistics and machine learning. The case for undersamplingis discussed below.

    In [17], the authors conclude that the ’undersampling technique’ has been the de-factostrategy to deal with imbalanced data, with some caveats, such as increasing the variance ofthe classifier/model. In [58], the authors concluded that undersampling is the best strategyfor an imbalance data-set, provided ’balanced accuracy’ is the primary optimization metricfor the given model (as is the case in this project).

    In [59], it was found that bagging and boosting did not lead to an improvement in per-formance compared to sampling methods, and in [59] it was mentioned that over-samplingtechniques could lead to over-fitting. Finally, in [60], it was concluded that the ’over-samplingtechnique’ yielded no performance improved compared to the ’undersampling technique’(for the decision tree C4.5 model).

    With the conclusions mentioned above from various authors, there is however, a possi-bility that undersampling can cause a change in the distribution in the data, specifically inthe ratio of conversion to other variables/touchpoints, thereby augmenting the importanceof the variable/touchpoint towards conversion. The result from section 2.3.3, shows us thatthe proportion of conversion (conv == 0) has changed, albeit marginally. Thus, the relativeimportance of channel/touchpoint may indeed not reflect the reality.

    37

  • 6.2. Method

    Finally, regarding the choice of ’undersampling’ as a technique used to tackling the im-balanced data, the very premise of this question was outside the primary scope of thisproject. The use of ’undersampling’ was just a means to an end. Furthermore, changingthe technique to negate the imbalanced data (such as bootstrapping) is unlikely to changethe conclusion of this project, which is the use of Bayesian Network for market attributionmodeling.

    6.2.2 Bayesian Network Modeling

    The baseline Logistic Regression Model was tuned for best possible balanced accuracy, simi-larly the same as attempted for the Bayesian Network Model.

    In 4.2.4, 132 rules were provided as black-list rules while building them. Three distincttechniques were employed for white-list creation, where the technique of using the ’Chi-Square Test’ was deemed the best owing to the predictive performance as evident from table5.2, however the reason for this remains unclear.

    The unimpressive predictive performance of models when using the Multiple Correspon-dence Analysis (MCA, see section 3.8.3) variables as white-list variables can be explainedas the following. Under MCA, the maximum variance is explained in a reduced dimensionspace often accompanied by some data rotation, while Bayesian Networks do neither of these.

    As mentioned in 4.2.3, the solution space for white-list variables was 33 million, of whichonly a small set of 55 thousand was explored. Furthermore, the performance of using brute-force was slightly worse than the use of the ’Chi-Square Test’. Thus, our solution was a localoptimum and not the global optima solution. However, given sufficient resources (comput-ing) and some optimisation (software), I firmly believe this issue could have been addressed.

    Given that to overcome the performance of the baseline logistic model, one needs a sub-stantial amount of domain knowledge in the form of white-list and black-list implies thatwithout domain knowledge, it is hard for Bayesian networks to be used for market attri-bution. Furthermore, these models are not strictly "data-driven," as mentioned in [6] butmore of a hybrid "data and domain-driven" technique. Thus, having a source of domainknowledge becomes a pre-requisite for optimal performance from the Bayesian network.

    One of the upsides of using the Bayesian Networks is that one can compute not only theattributions of touchpoints but also answer ’what-if’ scenarios (e.g: What is the probabilitythat a male respondent goes to shopping and uses coupons for discounts?) that may noteven have occurred (a unique path of touchpoints). Furthermore, one does not need to knowthe underlying distribution of the variables to perform these queries, these in my opinion,further reinforces and validates the choice of Bayesian Network for attribution modeling.

    6.2.3 Attribution Formula

    Under section 4.4, different formulas were used for logistic regression and Bayesian networkmodels. The reason for doing is that, in logistic regression, the coefficients obtained aftermodel training are constant for a given data. Thus equation 4.1 could not be used since thereis no variation in coefficients. One solution was to use bootstrapping to achieve the necessaryvariation, thereby enabling one to use equation 4.1 for logistic regression. However, thiswould lead to the question of not performing bootstrap on the Bayesian Network model.

    Under the Bayesian network for each sample of bootstrapped data, the underlying structurewould be different. Each unique structure leads to different conclusions about the touch-

    38

  • 6.3. Ethical Considerations

    points, which may lead to conflicts such as a touch-point being termed as direct influencing(parent or confounding to ’conv’ node) or indirect influencing (ancestral node to ’conv’ node).Furthermore, having multiple graphical network would be detrimental towards the projectobjective and in violation of ’Fairness (A good attribution system should reward an individ-ual channel per its ability to affect the likelihood of conversion [6])’ and ’Interpretability (Agood attribution system should be generally accepted by all parties with a material interestin the system, on the based on its statistical merit, as well as on the based on an intuitiveunderstanding of the components of the system [6].)’ properties of an attribution modelspecified by [6]. Thus, instead of having a disparity in techniques for attribution calculation,it was decided to have a disparity in the attribution formula instead.

    6.3 Ethical Considerations

    During this thesis project, the data used was anonymous. Thus, there is no risk of identifyingthe respondents. Furthermore, the respondents were paid by Nepa, and they (respondents)are aware of exactly how their data was captured and used. The entire data process is GDPRcompliant.

    6.4 Future Work

    The data used in the project had a large number of respondents (30+) for every touchpoint,thus the use of tests like ’Chi-Square test’ was possible. However, it is known that for smallsample sizes, independence test like ’Fisher’s exact test’ is applicable. This could be a niceimprovement to the project.

    In [19], the authors conclude that of over-sampling the minority (abnormal) class andundersampling, the majority (typical) class performs better than just undersampling. This issomething that could be looked into as well.

    6.5 Delimitations

    This project used one dataset for a specific product and geography, thereby its result andinsights are limited. However, this project is an attempt to prove the viability of Bayesiannetworks as a good fit for market attribution modeling. Finally, the viability of Bayesiannetworks would be further reinforced with the use of multiple datasets and in my opinion,remains one of the most significant shortcomings of the entire project.

    Furthermore, the use of χ2 test as white-list variables, although proved more effectivethan using grid-search in this project, is not without some risks. χ2 test does not distinguishbetween associated variable and confounding variables, thus it is recommended to use amore statistically rigours test (eg: partial correlations) to find these variables

    39

  • 7 Conclusion

    Can Bayesian Networks be a good fit for market attribution modeling?Looking at the distribution of attribution values and the comments of the supervisors atNepa, one is inclined to conclude that Bayesian Networks are the right choice of modelfor market attribution modeling. However, the fact that it took 132 black-list rules in ourBayesian Network to perform as well as it was achieved in this project suggests that domainknowledge remains quintessential for the creation of Bayesian Networks.

    Can Bayesian Networks outperform the baseline model in terms of accuracy, consid-ering that they emphasize ’Interpretability’?With the given dataset, the predictive performance of the chosen Bayesian Network ap-pears to be better than the baseline logistic regression model. The Bayesian Network modelhad approximately 3% higher accuracy and 2% higher balanced accuracy than the baselinelogistic model. From table 5.2, only 3 out of 12 of the Bayesian Networks evaluated, hadhigher accuracy. This hints that these models require a threshold amount of tuning to achievesuitable accuracy.

    40

  • Bibliography

    [1] Xuhui Shao and Lexin Li. “Data-driven multi-touch attribution models”. Proceedings ofthe 17th ACM SIGKDD international conference on Knowledge discovery and data mining.2011, pp. 258–264.

    [2] Priyanka Rawal. “AIDA Marketing Communication Model: Stimulating a purchasedecision in the minds of the consumers through a linear progression of steps”. Inter-national Journal of Multidisciplinary Research in Social & Management Sciences 1.1 (2013),pp. 37–44.

    [3] Mike T Bendixen. “Advertising effects and effectiveness”. European Journal of Marketing27.10 (1993), pp. 19–32.

    [4] Peter C Verhoef, Scott A Neslin, and Björn Vroomen. “Multichannel customer man-agement: Understanding the research-shopper phenomenon”. International journal ofresearch in marketing 24.2 (2007), pp. 129–148.

    [5] John Chandler-Pepelnjak. “Measuring roi beyond the last ad”. Atlas Institute DigitalMarketing Insight (2009), pp. 1–6.

    [6] Brian Dalessandro, Claudia Perlich, Ori Stitelman, and Foster Provost. “Causally moti-vated attribution for online advertising”. Proceedings of the Sixth International Workshopon Data Mining for Online Advertising and Internet Economy. 2012, pp. 1–9.

    [7] Hongshuang Li and PK Kannan. “Attributing conversions in a multichannel onlinemarketing environment: An empirical model and a field experiment”. Journal of Mar-keting Research 51.1 (2014), pp. 40–56.

    [8] Lizhen Xu, Jason A Duan, and Andrew Whinston. “Path to purchase: A mutually ex-citing point process model for online advertising and conversion”. Management Science60.6 (2014), pp. 1392–1412.

    [9] Vibhanshu Abhishek, Peter Fader, and Kartik Hosanagar. “Media exposure throughthe funnel: A model of multi-stage attribution”. Available at SSRN 2158421 (2012).

    [10] Pavel Kireyev, Koen Pauwels, and Sunil Gupta. “Do display ads influence search? At-tribution and dynamics in online advertising”