Games and Big Data: A Scalable Multi-Dimensional Churn Prediction Model
Paul Bertens, Anna Guitart and África Periáñez (Silicon Studio)
CIG 2017 New York 23rd August 2017
2
Who are we?
Game studio and graphics middleware company based in Tokyo (spin-off of Silicon Graphics)
YOKOZUNA data: Research unit of Game Data Science providing individual player predictions to game studios
Goals: predict player behavior, scale to big data and provide an intuitive result visualization
Churn prediction in Free-To-Play games
31) Rothenbuehler J. et al., 2015. Hidden markov models for churn prediction.2) Periáñez A. et al., 2016. Churn prediction in mobile social games: towards a complete assessment using survival ensembles
This is all about churn… When a player is going to exit the game?
→ In terms of days1,2, level, played hours
We focus on the top spenders: whalesless than 2% of the players, 50 % of the revenue
General model that adapts to diverse games and datasets we define churn as 10 days of inactivity (coming back only 1% revenue)the definition of churn in F2P games is not straightforward
Parallelizable algorithmapplied in a production environmentscalable to Big Data up to tens of millions MAU
The model: Survival Ensembles
4
Challenge: modeling churn
Survival analysis focuses on predicting the time-to-event, e.g. churn
➔ Survival analysis is used in biology and medicine to deal with this problem
➔ Ensemble learning techniques provide high-class prediction results
Classical methods, like regressions, are appropriate when all players have left the gameCensoring Problem: dataset with incomplete churning informationCensoring is the nature of churn
○ when a player will stop playing?
5
6
● Two approaches:○ Churn as a binary classification○ Churn as a censored data problem
➔ Survival analysis methods (e.g. Cox regression3) do not follow any particular statistical distribution: fitted from data
➔ Fixed link between output and features: efforts to model selection and evaluation
2) Hothorn T. et al., 2006. Unbiased recursive partitioning: A conditional inference framework.3) Cox. D.R., 1972. Regression Models and Life-Tables.
● One model: Conditional Inference Survival Ensembles2 ○ deals with censoring ○ high accuracy due to ensemble learning
Survival Analysis
Challenge: modeling churn
Survival Tree
➔ Split the feature space recursively
➔ Based on survival statistical criterion the root node is divided in two daughter nodes
➔ Maximize the survival difference between nodes
➔ A single tree produces instability predictions
Conditional Survival Ensembles
➔ Make use of hundreds of trees
➔ Outstanding predictions ➔ Conditional inference survival ensemble use a
Kaplan-Meier function as splitting criterion
➔ Robust information about variable importance ✓
➔ Overfit is not present ✓
➔ Not biased approach ✓
Conditional inference survival ensembles
7
● Two steps algorithm:○ 1) the optimal split variable is selected: association between covariates
and response○ 2) the optimal split point is determined by comparing two-sample linear
statistics for all possible partitions of the split variable
Random Survival Forest4 ➔ RSF is based on original random forest algorithm5
➔ RSF favors variables with many possible split points over variables with fewer
4) Ishwaran H. et. al, 2008. Random Survival Forests.5) Breiman L. et. al, 2001. Random Forests. 8
Conditional inference survival ensembles
9
● Cumulative survival probability ● Step function that changes every time that a player churns● Output in terms of level and playtime (hours played)
Kaplan-Meier estimates6
6) Kaplan E. L. et. al., 1958. Non-parametric estimation from incomplete observations.
ResultsConditional Inference Survival Ensembles
10
Features selection
Daily logins, purchases, playtime and level-ups
player attention: ● information per day (e.g.playtime per day)
player loyalty: ● mean over several different time periods● time elapsed until first and last day to
information (e.g. time from last purchase)player intensity:
● total amount (e.g. total in-app purchases)
player level (concept common to most games)
11
Features selection
Daily logins, purchases, playtime and level-ups
player attention: ● information per day (e.g.playtime per day)
player loyalty: ● mean over several different time periods● time elapsed until first and last day to
information (e.g. time from last purchase)player intensity:
● total amount (e.g. total in-app purchases)
player level (concept common to most games)
12
RPG free-to-play gameAction battle card game popular in JapanLong-term loyal players
Predicted Kaplan-Meier survival curves as a function of playtime (hours) and level for new or existing players
Censored data problem results
13
Validation -- Churn prediction
Survival Ensembles
Cox Regression
14
median survival level, i.e. level when the percentage of surviving in the game is 50%
Survival Ensembles
Cox Regression
15
median survival playtime, i.e. number of played hours when the percentage of surviving in the game is 50%
Validation -- Churn prediction
1000 bootstrap cross-validation error curves for the survival ensemble model and Cox regression
Model IBS
Survival EnsembleCox RegressionKaplan Meier
0.0250.0540.127
16
7
7) Graf E.. et. al, 1999. Assessment and comparison of prognostic classification schemes for survival data.
Validation -- Churn prediction
1000 bootstrap cross-validation error curves for the survival ensemble model and Cox regression
Model IBS
Survival EnsembleCox RegressionKaplan Meier
0.0260.0440.134
177) Graf E.. et. al, 1999. Assessment and comparison of prognostic classification schemes for survival data.
7
Validation -- Churn prediction
Summary and conclusion
● Application of state-of-the-art algorithm “conditional inference survival ensembles” ○ to predict churn and survival probability of players in social games○ median survival time, i.e. time when the percentage of surviving in the game is 50%,
can be used as a time threshold to categorize a player in the risk of churning
Model able to make predictions every day in an operational environmentAdapts to other game data: Democratizing Game Data Science YOKOZUNA data
○ It does not require previous manipulation of the data○ It is able to deal efficiently with the temporary dimension○ It can be parallelized○ It not only outputs churn information but also variable importance
18
THANK YOU
19
yokozunadata.com