Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
ePubWU Institutional Repository
Thomas Rusch and Kurt Hornik and Wolfgang Janko and Ilro Lee and AchimZeileis
Targeting Voters with Logistic Regression Trees. DAGM GfKl 2011, Frankfurtam Main, Deutschland, 30.08.-02.09.
Conference or Workshop Item
Original Citation:
Rusch, Thomas and Hornik, Kurt ORCID: https://orcid.org/0000-0003-4198-9911 and Janko,Wolfgang and Lee, Ilro and Zeileis, Achim
(2011)
Targeting Voters with Logistic Regression Trees. DAGM GfKl 2011, Frankfurt am Main, Deutschland,30.08.-02.09.
In: DAGM GfKl 2011, 30.08.-02.09., Frankfurt am Main.
This version is available at: https://epub.wu.ac.at/3748/Available in ePubWU: January 2013
ePubWU, the institutional repository of the WU Vienna University of Economics and Business, isprovided by the University Library and the IT-Services. The aim is to enable open access to thescholarly output of the WU.
http://epub.wu.ac.at/
Targeting Voters WithLogistic Regression Trees
SLIDE 1 DAGM GfKl 2011, 01-09-11
Outline
1 Motivation
2 Targeting Voters
3 Logistic Regression Trees
4 Illustration
5 Results
6 Discussion
This is joint work with Kurt Hornik, Wolfgang Jank, Ilro Lee and AchimZeileis.
SLIDE 2 DAGM GfKl 2011, 01-09-11
Political Campaigns
Political campaigning is a multi-million dollar business andincreasingly so (doubled in eight years)
For example, in the US presidental race in 2008 all candidatesspent 1.6 Bill. USDBarack Obama’s 2008 campaign spent 513 Mill. USD
A large portion of money is spent on mobilizing voters
Obama had 69.5 Mill. Voters in 2008That amounts to 7.39 USD spent per actual voter
Turnout has actually not changed much for presidential electionsover the years
Hence the money spent per actual voter increased substantiallywhile effect on turnout was limited
SLIDE 3 DAGM GfKl 2011, 01-09-11
Targeting Voters
The reason is that money is spent on people who would haveturned out anyway (or not)It is therefore economically imperative to identify individualsworth targetingTo assess how likely turnout for each voter is, data and statisticsare already used by campaings, e.g.
VariablesIndividual historic voting recordsAge
MethodsRanking according to attended electionsLinear regression on relative frequencyLogistic regressionChi-Square Automatic Interaction Detection (CHAID)
We propose a new approach to single out likelyvoters/non-voters to more efficiently allocate ressourcesAdditionally, usually there are more variables availableSLIDE 4 DAGM GfKl 2011, 01-09-11
Our Approach
Logistic Regression Trees (LORET) have a number of advantages forthe task at hand
They are flexible and model possibly non-linear relationships
They can find high-order interactions
There exist non-parametric or parametric versions
They can be interpreted intuitively
They include the (sensible) usual methods as special cases
They can be turned into managerial decisions quite easily
SLIDE 5 DAGM GfKl 2011, 01-09-11
Logistic Regression Tree Model
Let Y denote the response and Z and X denote a covariate matrix ofpartitioning and linear predictor variables respectively.
LORET build a partition based relationship between Y and Z withr disjoint cells
In each partition {Bk}, k = 1, . . . , r we have a logistic model withlinear predictor Xβ
P(Yi,k = 1|xi,k) = pi,k =exp(xT
i,kβ(k))
1 + exp(xTi,kβ
(k)), (1)
Yi,k (i = 1, . . . , nk) ... observation in partition kxi,k ... predictor variable vector for observation i in partition kβ(k) .... partition specific parameter vectorpi,k.... probability to vote of observation i in partition k
SLIDE 6 DAGM GfKl 2011, 01-09-11
Special Instances Of LORET
LORET subsumes a number of well-known methods:
Given the partition, classification trees specify a intercept-onlylogistic model in the nodes
Logistic models are a single partition containing all observationswith a logistic model in that node
Majority vote is an intercept-only model in a single partitioncontaining all observations
A functional tree hybrid of logistic model and partitioning
Method Predictor Var. Partitioning Var. Schema
Majority Vote none none y ∼ 1Logistic Regression X none y ∼ XBinary Classification Tree none Z y ∼ 1|ZModel-based Tree X Z y ∼ X|Z
SLIDE 7 DAGM GfKl 2011, 01-09-11
Estimation
These similarities let us estimate LORET with well known algorithms
Logistic regression models: Iteratively Reweighted Least Squares
Classification Trees: CART, Ctree, C4.5, C5
Hybrid models: MOB, LOTUS, GUIDE, SUPPORT
In what follows we looked only at the methods in blue.
SLIDE 8 DAGM GfKl 2011, 01-09-11
Illustration - Data Description
We will focus on a Get-Out-The-Vote (GOTV) application
GOTV campaigns try to mobilize eligible voters to participate inan election
We have n = 19624 eligible voters from Ohio
Explanatory variables (p = 84):
Voting history (1990-2004), relative frequency of attendedelectionsDemographic variablesParty affiliation variables
We assess the performance of different LORET
We compare the usage of the usual variables and additionalvariables
SLIDE 9 DAGM GfKl 2011, 01-09-11
Illustration - Benchmark Study Design
We divided all variables into two sets (age and last four years(simple), all other variables (extended))
The simple and the simple and extended variable sets were usedwith logistic regression and classification trees respectively
For model-based trees we used the simple set in the logisticmodel part and the extended set for the partitioning part
To assess prediction performance we used a bootstrap samplingapproach for training and test data (10 samples)
The methods were compared with accuracy and (area under the)ROC curves for the out-of-bag samples
SLIDE 10 DAGM GfKl 2011, 01-09-11
Results - Accuracies At 0.5
●
●
MV s. LR s. RP s. CT LR RP CT MOB
0.70
0.75
0.80
0.85
Figure: Boxplot of accuracies over all 10 bootstrap samples (simple data set in the middle,extended and simple data set to the right).
SLIDE 11 DAGM GfKl 2011, 01-09-11
Results - Table
Mean SD Mean ParameterMethod Accuracy Accuracy AUC per Segment Splits SegmentsMajority Vote .704 .004 .500 1 1 1s. LogReg .750 .003 .738 8 1 1s. Ctree .759 .004 .765 1 14 15s. CART .760 .005 .745 1 27.5 28.5e. Logreg .847 .003 .885 56 1 1e. Ctree .858 .003 .898 1 17 18e. CART .860 .004 .878 1 22.5 23.5MOB .855 .004 .903 8 9 10
Table: Table of mean and sd of accuracies, the mean AUC, median number of parametersover all 10 bootstrap samples for the combined extended and simple data set.
SLIDE 12 DAGM GfKl 2011, 01-09-11
Results - Mean Accuracies
Accuracy across the range of possible cutoffs
Cutoff
Ave
rage
acc
urac
y
0.0 0.2 0.4 0.6 0.8 1.0
0.3
0.4
0.5
0.6
0.7
0.8
s. Conditional Inference TreeMajority Votes. CARTs.Logistic RegressionConditional Inference TreeModel Based TreeCARTLogistic Regression
Figure: Mean accuracies for different cut off values for the extended and simple data sets.
SLIDE 13 DAGM GfKl 2011, 01-09-11
Results - ROC Curves
Averaged ROCs
Average false positive rate
Ave
rage
true
pos
itive
rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
s. Conditional Inference TreeMajority votes.CARTs.Logistic RegressionConditional Inference TreeModel Based TreeCARTLogistic Regression
Figure: ROC curves for methods with the extended and simple data sets.
SLIDE 14 DAGM GfKl 2011, 01-09-11
Results - Mean Accuracy Difference
Accuracy at 0.5
−0.005 0.000 0.005 0.010 0.015
MOB − CT
MOB − RP
CT − RP
MOB − LR
CT − LR
RP − LR (
(
(
(
(
(
)
)
)
)
)
)
●
●
●
●
●
●
Accuracy at 0.5
Performancedifference
Figure: Simultaneous 95% CI for the mean accuracy difference.
SLIDE 15 DAGM GfKl 2011, 01-09-11
Results - Mean AUC Difference
AUC
−0.01 0.00 0.01 0.02 0.03
MOB − CT
MOB − RP
CT − RP
MOB − LR
CT − LR
RP − LR (
(
(
(
(
(
)
)
)
)
)
)
●
●
●
●
●
●
AUC
Performancedifference
Figure: Simultaneous 95% CI for the mean AUC difference.
SLIDE 16 DAGM GfKl 2011, 01-09-11
Discussion - I
Variable sets
Using only age and the last four comparable elections will notbring us farIt is much better to include the additional variablesOne can gain up to 10% classification accuracy
Methods
The tree based methods significantly outperform the othermethods in accuracy and AUCCART and Ctree have highest accuracy at 0.5, MOB has highestAUCTree based methods are more reliable than LR and MV
SLIDE 17 DAGM GfKl 2011, 01-09-11
Discussion - II
Intelligibility/Complexity
Intercept-only tree based methods have a high median number ofsplits, logistic regression has a high number of estimatedparametersMOB divides the complexity onto two levels: splits and logisticmodel partMOB can be more parsimonous in splits and logistic model and ismore intelligible
Managerial Decision
Of all tree methods, MOB found the highest number of people inthe usual 0.3 to 0.6 targeting range (CART and Ctree)If all the targeted people would vote, targeting with MOB couldhave increased turnout by 7.5%
SLIDE 18 DAGM GfKl 2011, 01-09-11
Conclusion
In voter targeting a limited set of explanatory variables as wellas suboptimal methods are used
We tried to improve voter targeting approaches by proposing aframework of Logistic Regression Trees (LORET) and using morevariables
A bootstrap sampling study on a real data set from Ohio wasused to assess the performance
We found that inclusion of more variables was better
We found that tree approaches outperformed logistic model
Hence we suggest to use LORET for voter targeting
A good compromise between classification accuracy, complexity,intelligibility and managerial decision seems to be the usage of aLORET of the form y ∼ X|Z where X is the limited set of variables
SLIDE 19 DAGM GfKl 2011, 01-09-11
References
Malchow, H. (2008) Political Targeting. Predicted Lists, LTD, 2ndEdition.
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased RecursivePartitioning: A Conditional Inference Framework. JCGS, 15, 651–674.
Breiman, L., Friedman, J., Olsen, R. & Stone, C. (1984). ClassificationAnd Regression Trees. Wadsworth.
Zeileis, A., Hothorn, T. & Hornik, K. (2008) Model-based RecursivePartitioning. JCGS, 17, 492–514.
McCullough, P. & Nelder, J. (1989). Generalized Linear Models.Chapman and Hall, 2nd Ed.
SLIDE 20 DAGM GfKl 2011, 01-09-11
Thank you for your Attention
Thomas RuschDepartment of Finance, Accounting and StatisticsInstitute for Statistics and Mathematicsemail: [email protected]: http://statmath.wu.ac.at/˜tr
WU Wirtschaftsuniversität WienAugasse 2–6, A-1090 Wien
SLIDE 21 DAGM GfKl 2011, 01-09-11