Demographics andweblogtargeting

Preview:

Citation preview

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Demographics and Weblog Hackathon – Case Study

5.3% of Motley Fool visitors are subscribers. Design a classificaiton model for insight into which variables are

important for strategies to increase the subscription rate

Learn by Doing

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

http://www.meetup.com/HandsOnProgrammingEvents/

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Data Mining Hackathon

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Funded by Rapleaf

• With Motley Fool’s data• App note for Rapleaf/Motley Fool • Template for other hackathons• Did not use AWS. R on individual PCs• Logisics: Rapleaf funded prizes and food for 2

weekends for ~20-50. Venue was free

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Getting more subscribers

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Headline Data, Weblog

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Demographics

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Cleaning Data

• training.csv(201,000), headlines.tsv(811MB), entry.tsv(100k), demographics.tsv

• Feature Engineering• Github:

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Ensemble Methods

• Bagging, Boosting, randomForests• Overfitting• Stability (small changes make large prediction

changes)• Previously none of these work at scale• Small scale results using R, large scale exist in

proprietary implementations(google, amazon, etc..)

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

ROC Curves

Binary Classifier Only!

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Paid Subscriber ROC curve, ~61%

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Boosted Regression Trees Performance

• training data ROC score = 0.745 • cv ROC score = 0.737 ; se = 0.002• 5.5% less performance than the winning score

without doing any data processing• Random is 50% or .50. We are .737-.50 better

than random by 23.7%

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Contribution of predictor variables

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Predictive Importance• Friedman, number of times a variable is selected for splitting weighted by

squared error or improvement to model. Measure of sparsity in data• Fit plots remove averages of model variables• 1 pageV 74.0567852• 2 loc 11.0801383• 3 income 4.1565597• 4 age 3.1426519• 5 residlen 3.0813927• 6 home 2.3308287• 7 marital 0.6560258• 8 sex 0.6476549• 9 prop 0.3817017• 10 child 0.2632598• 11 own 0.2030012

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Behavioral vs. Demographics

• Demographics are sparse• Behavioral weblogs are the best source. Most

sites aren’t using this information correctly. There is no single correct answer. Trial and Error on features. The features are more important than the algorithm

• Linear vs. Nonlinear

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Fitted Values (Crappy)

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Fitted Values Better

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Predictor Variable Interaction

• Adjusting variable interactions

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Variable Interactions

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Plot Interactions age, loc

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Trees vs. other methods

• Can see multiple levels good for trees. Do other variables match this? Simplify model or add more features. Iterate to a better model

• No Math. Analyst

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Number of Trees

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Data Set Number of Trees

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Hackathon Results

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Weblogs only 68.15%, 18% better than random

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Demographics add 1%

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

AWS Advantages

• Running multiple instances with different algorithms and parameters using R

• Add tutorial, install Screen, R GUI bugs• http://amazonlabs.pbworks.com/w/page/280

36646/FrontPage

copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Conclusion

• Data Mining at scale requires more development in visualization, MR algorithms, MR data preprocessing.

• Tuning using visualization. Tune 3 parameters, tc, lr, #trees. Didn’t cover 2/3.

• This isn’t reproducable in Hadoop/Mahout or any open source code I know of

• Other use cases, i.e. predicting which item will sell(eBay), search engine ranking.

• Careful with MR paradigms, Hadoop MR != Couchbase MR