Download (320Kb) - ePub WU

ePubWU Institutional Repository

Thomas Rusch and Kurt Hornik and Wolfgang Janko and Ilro Lee and AchimZeileis

Targeting Voters with Logistic Regression Trees. DAGM GfKl 2011, Frankfurtam Main, Deutschland, 30.08.-02.09.

Conference or Workshop Item

Original Citation:

Rusch, Thomas and Hornik, Kurt ORCID: https://orcid.org/0000-0003-4198-9911 and Janko,Wolfgang and Lee, Ilro and Zeileis, Achim

(2011)

Targeting Voters with Logistic Regression Trees. DAGM GfKl 2011, Frankfurt am Main, Deutschland,30.08.-02.09.

In: DAGM GfKl 2011, 30.08.-02.09., Frankfurt am Main.

This version is available at: https://epub.wu.ac.at/3748/Available in ePubWU: January 2013

ePubWU, the institutional repository of the WU Vienna University of Economics and Business, isprovided by the University Library and the IT-Services. The aim is to enable open access to thescholarly output of the WU.

http://epub.wu.ac.at/

https://orcid.org/0000-0003-4198-9911

https://epub.wu.ac.at/3748/

http://epub.wu.ac.at/

Targeting Voters WithLogistic Regression Trees

SLIDE 1 DAGM GfKl 2011, 01-09-11

Outline

1 Motivation

2 Targeting Voters

3 Logistic Regression Trees

4 Illustration

5 Results

6 Discussion

This is joint work with Kurt Hornik, Wolfgang Jank, Ilro Lee and AchimZeileis.


Political Campaigns

Political campaigning is a multi-million dollar business andincreasingly so (doubled in eight years)

For example, in the US presidental race in 2008 all candidatesspent 1.6 Bill. USDBarack Obama’s 2008 campaign spent 513 Mill. USD

A large portion of money is spent on mobilizing voters

Obama had 69.5 Mill. Voters in 2008That amounts to 7.39 USD spent per actual voter

Turnout has actually not changed much for presidential electionsover the years

Hence the money spent per actual voter increased substantiallywhile effect on turnout was limited


Targeting Voters

The reason is that money is spent on people who would haveturned out anyway (or not)It is therefore economically imperative to identify individualsworth targetingTo assess how likely turnout for each voter is, data and statisticsare already used by campaings, e.g.

VariablesIndividual historic voting recordsAge

MethodsRanking according to attended electionsLinear regression on relative frequencyLogistic regressionChi-Square Automatic Interaction Detection (CHAID)

We propose a new approach to single out likelyvoters/non-voters to more efficiently allocate ressourcesAdditionally, usually there are more variables availableSLIDE 4 DAGM GfKl 2011, 01-09-11

Our Approach

Logistic Regression Trees (LORET) have a number of advantages forthe task at hand

They are flexible and model possibly non-linear relationships

They can find high-order interactions

There exist non-parametric or parametric versions

They can be interpreted intuitively

They include the (sensible) usual methods as special cases

They can be turned into managerial decisions quite easily


Logistic Regression Tree Model

Let Y denote the response and Z and X denote a covariate matrix ofpartitioning and linear predictor variables respectively.

LORET build a partition based relationship between Y and Z withr disjoint cells

In each partition {Bk}, k = 1, . . . , r we have a logistic model withlinear predictor Xβ

P(Yi,k = 1|xi,k) = pi,k =exp(xT

i,kβ(k))

1 + exp(xTi,kβ

(k)), (1)

Yi,k (i = 1, . . . , nk) ... observation in partition kxi,k ... predictor variable vector for observation i in partition kβ(k) .... partition specific parameter vectorpi,k.... probability to vote of observation i in partition k


Special Instances Of LORET

LORET subsumes a number of well-known methods:

Given the partition, classification trees specify a intercept-onlylogistic model in the nodes

Logistic models are a single partition containing all observationswith a logistic model in that node

Majority vote is an intercept-only model in a single partitioncontaining all observations

A functional tree hybrid of logistic model and partitioning

Method Predictor Var. Partitioning Var. Schema

Majority Vote none none y ∼ 1Logistic Regression X none y ∼ XBinary Classification Tree none Z y ∼ 1|ZModel-based Tree X Z y ∼ X|Z


Estimation

These similarities let us estimate LORET with well known algorithms

Logistic regression models: Iteratively Reweighted Least Squares

Classification Trees: CART, Ctree, C4.5, C5

Hybrid models: MOB, LOTUS, GUIDE, SUPPORT

In what follows we looked only at the methods in blue.


Illustration - Data Description

We will focus on a Get-Out-The-Vote (GOTV) application

GOTV campaigns try to mobilize eligible voters to participate inan election

We have n = 19624 eligible voters from Ohio

Explanatory variables (p = 84):

Voting history (1990-2004), relative frequency of attendedelectionsDemographic variablesParty affiliation variables

We assess the performance of different LORET

We compare the usage of the usual variables and additionalvariables


Illustration - Benchmark Study Design

We divided all variables into two sets (age and last four years(simple), all other variables (extended))

The simple and the simple and extended variable sets were usedwith logistic regression and classification trees respectively

For model-based trees we used the simple set in the logisticmodel part and the extended set for the partitioning part

To assess prediction performance we used a bootstrap samplingapproach for training and test data (10 samples)

The methods were compared with accuracy and (area under the)ROC curves for the out-of-bag samples

SLIDE 10 DAGM GfKl 2011, 01-09-11

Results - Accuracies At 0.5

●

●

MV s. LR s. RP s. CT LR RP CT MOB

0.70

0.75

0.80

0.85

Figure: Boxplot of accuracies over all 10 bootstrap samples (simple data set in the middle,extended and simple data set to the right).

SLIDE 11 DAGM GfKl 2011, 01-09-11

Results - Table

Mean SD Mean ParameterMethod Accuracy Accuracy AUC per Segment Splits SegmentsMajority Vote .704 .004 .500 1 1 1s. LogReg .750 .003 .738 8 1 1s. Ctree .759 .004 .765 1 14 15s. CART .760 .005 .745 1 27.5 28.5e. Logreg .847 .003 .885 56 1 1e. Ctree .858 .003 .898 1 17 18e. CART .860 .004 .878 1 22.5 23.5MOB .855 .004 .903 8 9 10

Table: Table of mean and sd of accuracies, the mean AUC, median number of parametersover all 10 bootstrap samples for the combined extended and simple data set.

SLIDE 12 DAGM GfKl 2011, 01-09-11

Results - Mean Accuracies

Accuracy across the range of possible cutoffs

Cutoff

Ave

rage

acc

urac

y

0.0 0.2 0.4 0.6 0.8 1.0

0.3

0.4

0.5

0.6

0.7

0.8

s. Conditional Inference TreeMajority Votes. CARTs.Logistic RegressionConditional Inference TreeModel Based TreeCARTLogistic Regression

Figure: Mean accuracies for different cut off values for the extended and simple data sets.

SLIDE 13 DAGM GfKl 2011, 01-09-11

Results - ROC Curves

Averaged ROCs

Average false positive rate

Ave

rage

true

pos

itive

rat

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

s. Conditional Inference TreeMajority votes.CARTs.Logistic RegressionConditional Inference TreeModel Based TreeCARTLogistic Regression

Figure: ROC curves for methods with the extended and simple data sets.

SLIDE 14 DAGM GfKl 2011, 01-09-11

Results - Mean Accuracy Difference

Accuracy at 0.5

−0.005 0.000 0.005 0.010 0.015

MOB − CT

MOB − RP

CT − RP

MOB − LR

CT − LR

RP − LR (

(

(

(

(

(

)

)

)

)

)

)

●

●

●

●

●

●

Accuracy at 0.5

Performancedifference

Figure: Simultaneous 95% CI for the mean accuracy difference.

SLIDE 15 DAGM GfKl 2011, 01-09-11

Results - Mean AUC Difference

AUC

−0.01 0.00 0.01 0.02 0.03

MOB − CT

MOB − RP

CT − RP

MOB − LR

CT − LR

RP − LR (

(

(

(

(

(

)

)

)

)

)

)

●

●

●

●

●

●

AUC

Performancedifference

Figure: Simultaneous 95% CI for the mean AUC difference.

SLIDE 16 DAGM GfKl 2011, 01-09-11

Discussion - I

Variable sets

Using only age and the last four comparable elections will notbring us farIt is much better to include the additional variablesOne can gain up to 10% classification accuracy

Methods

The tree based methods significantly outperform the othermethods in accuracy and AUCCART and Ctree have highest accuracy at 0.5, MOB has highestAUCTree based methods are more reliable than LR and MV

SLIDE 17 DAGM GfKl 2011, 01-09-11

Discussion - II

Intelligibility/Complexity

Intercept-only tree based methods have a high median number ofsplits, logistic regression has a high number of estimatedparametersMOB divides the complexity onto two levels: splits and logisticmodel partMOB can be more parsimonous in splits and logistic model and ismore intelligible

Managerial Decision

Of all tree methods, MOB found the highest number of people inthe usual 0.3 to 0.6 targeting range (CART and Ctree)If all the targeted people would vote, targeting with MOB couldhave increased turnout by 7.5%

SLIDE 18 DAGM GfKl 2011, 01-09-11

Conclusion

In voter targeting a limited set of explanatory variables as wellas suboptimal methods are used

We tried to improve voter targeting approaches by proposing aframework of Logistic Regression Trees (LORET) and using morevariables

A bootstrap sampling study on a real data set from Ohio wasused to assess the performance

We found that inclusion of more variables was better

We found that tree approaches outperformed logistic model

Hence we suggest to use LORET for voter targeting

A good compromise between classification accuracy, complexity,intelligibility and managerial decision seems to be the usage of aLORET of the form y ∼ X|Z where X is the limited set of variables

SLIDE 19 DAGM GfKl 2011, 01-09-11

References

Malchow, H. (2008) Political Targeting. Predicted Lists, LTD, 2ndEdition.

Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased RecursivePartitioning: A Conditional Inference Framework. JCGS, 15, 651–674.

Breiman, L., Friedman, J., Olsen, R. & Stone, C. (1984). ClassificationAnd Regression Trees. Wadsworth.

Zeileis, A., Hothorn, T. & Hornik, K. (2008) Model-based RecursivePartitioning. JCGS, 17, 492–514.

McCullough, P. & Nelder, J. (1989). Generalized Linear Models.Chapman and Hall, 2nd Ed.

SLIDE 20 DAGM GfKl 2011, 01-09-11

Thank you for your Attention

Thomas RuschDepartment of Finance, Accounting and StatisticsInstitute for Statistics and Mathematicsemail: [email protected]: http://statmath.wu.ac.at/˜tr

WU Wirtschaftsuniversität WienAugasse 2–6, A-1090 Wien

SLIDE 21 DAGM GfKl 2011, 01-09-11

Documents

Download (320Kb) - ePub WU