Personalization and Optimization of Decision Parameters ...Personalization and Optimization of Decision Parameters via Heterogenous Causal Effects , 2019 (a) Example of Notifications

Personalization and Optimization of Decision Parameters viaHeterogenous Causal Effects

Ye TuLinkedIn [email protected]

Kinjal BasuLinkedIn [email protected]

Jinyun YanLinkedIn [email protected]

Birjodh TiwanaLinkedIn [email protected]

Shaunak ChatterjeeLinkedIn Corporation

[email protected]

ABSTRACTRandomized experimentation (also known as A/B testing or buckettesting) is very commonly used in the internet industry to measurethe effect of a new treatment. Often, the decision on the basis ofsuch A/B testing is to ramp the treatment variant that did best forthe entire population. However, the effect of any given treatmentvaries across experimental units, and choosing a single variant toramp to the whole population can be quite suboptimal. In this work,we propose a method which automatically identifies the collectionof cohorts exhibiting heterogeneous treatment effect (using causaltrees). We then use stochastic optimization to identify the optimaltreatment variant in each cohort.

We use two real-life examples — one related to serving notifica-tions and the other related to modulating ads density on feed. Inboth examples, using offline simulation and online experimentation,we demonstrate the benefits of our approach. At the time of writingthis paper, the method described has been deployed on the LinkedInAds and Notifications system.

CCS CONCEPTS• Information systems→ Personalization; Online advertising;Social networks; • Mathematics of computing → Causal net-works; • Theory of computation → Stochastic control and op-timization;

KEYWORDSHeterogeneous causal effects, Stochastic optimization, Personaliza-tion, Online decision parameter

ACM Reference Format:Ye Tu, Kinjal Basu, Jinyun Yan, Birjodh Tiwana, and Shaunak Chatterjee.2019. Personalization and Optimization of Decision Parameters via Het-erogenous Causal Effects. In Proceedings of . ACM, New York, NY, USA,10 pages. https://doi.org/

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s)., 2019© 2019 Copyright held by the owner/author(s).

1 INTRODUCTIONIn large-scale social media platforms, the member experience isoften controlled by certain parameters in conjunctionwithmachine-learned models. Such parameters could be independent of themachine-learned models, for instance, the background color andfont of the platform. Alternately, they could be key decision pa-rameters that operate on the output of machine-learned models todetermine the overall member experience.

An example is the decision to send or drop a notification de-pending on a threshold parameter, which is used directly on theoutput of a machine learning model. If the score of the notificationobtained from a machine-learned model is greater than the thresh-old, we decide to send the notification otherwise we drop it [10].The machine learning model controls the relevance of the item,while the threshold controls certain business metrics. An argumentfor not having a global threshold is to prevent a member who hasmany high-scoring notifications from getting too many and to haveanother member who has no high-scoring ones to still get some inspecial scenarios (e.g., a new member).

A second example of such an influential parameter is the gapbetween successive ads on a newsfeed product. The parameter inquestion, which operates independently of the related machinelearning models, specifies the number of organic updates that areshown between two successive ads. The machine learning modelscontrol the ranking of both the ads and organic updates.

In an ideal situation, we would often want such parameters tobe fully personalized. However, in order to do so, we would needrandomized data across the different choices of parameters for eachindividual. But that is an extremely challenging problem since manymembers may not visit frequently enough for us to collect such data,and even for the members who do, such expansive data collectioncould severely compromise the member experience for too manymembers. A potential solution is to fix a global parameter (e.g., adsgap or notification threshold) for all members and then iterate onthe choice of this parameter throughA/B testing. However, that mayvery well be a suboptimal solution. There can be a potential groupof users, for whom if we had chosen a different parameter, we couldhave had an improved overall objective. More formally, estimatingthe heterogeneous causal effects from a randomized experiment[14, 26] is critical in providing a more personalized experience tothe member and maximizing business objectives.

In this paper, we describe a formal mechanism which allowsus to use randomized experiment data, identify a set of heteroge-neous cohorts of members and apply an optimized parameter for

arX

iv:1

901.

1055

0v2

[st

at.M

E]

4 F

eb 2

019

https://doi.org/

, 2019 Tu et al.

each such cohort to maximize one business metric while satisfy-ing constraints on others. Throughout this paper, we assume thatwe have the ability to run randomized experiments on memberswith different values of the parameter in question, and capturethe member related actions (i.e., business metrics) such as clicks,comments, views, scrolls, etc. This assumption is very reasonablesince such randomized experiments would be conducted regardlessto identify the globally optimal value. Once we have the data, weuse the recursive partitioning technique [5] to generate the het-erogeneous cohorts along with an estimate of the causal effect foreach cohort. Then we solve the optimization problem based on thecohort definition using a modified version of the algorithm from[15] to generate the optimal parameter value for each cohort group.

We show the benefit of our proposed approach on two differ-ent applications at LinkedIn. The first problem is estimating thethreshold for notifications sent from LinkedIn, and the second oneis determining the minimum gap for ads in the LinkedIn feed. Foran overview of these problems, please see [4, 10, 16]. Note that ourprocedure is general and can be applied to any similar problem.Through relevant offline simulations and online A/B testing exper-iments, we show that our solution performs significantly betterthan estimating one global parameter for both of these examples.

The rest of the paper is organized as follows. In Section 2, we de-scribe the generic problem formulation while focusing on LinkedInNotifications and Ads system. We then describe our overall method-ology in Section 3. After describing the data generation mechanismin Section 4, we briefly describe the engineering architecture re-quired to solve the problem in production in Section 5. Offline andonline experimental results are discussed in Section 6, followed bysome of the related works in Section 7. We finally conclude with adiscussion in Section 8.

2 PROBLEM SETUPIt is well known that individual member behavior can be distinctlydifferent even when the same treatment or intervention is applied[5]. Hence, member-level personalization is a crucial problem inany web-facing company. It has become a standard practice for web-facing companies to estimate causal effects in order to evaluate newfeatures and guide product development [14, 25, 28].

There are two major challenges in using randomized experi-ments to fully personalize product development. First, it may not bepossible to obtain large amounts of data on every single individual,without compromising the member experience of the entire popu-lation by a huge amount. Such a compromise is generally deemedunacceptable and for good reason. Secondly, the lack of sufficientdata leads to a large variance in any member-level personalizationmodel. Thus, instead of going towards member-level personaliza-tion, we frame the problem of personalizing (or customizing) at amember cohort level. This hits the sweet spot by achieving a moreoptimal solution that the single global parameter while containingthe risk of data collection. Choosing a global parameter usuallygives a lower Pareto optimal curve, while choosing ad-hoc cohortscan improve it a bit. What we are truly interested in, is to identifythe optimal cohorts and an optimal parameter for each such cohortso that we keep improving on the Pareto optimal front. See Figure1 for details.

Guardrail

Objective

Global Ad-hoc

Optimal

Parameter Cohorts

Cohorts

Figure 1: Pareto Optimal Curves for Objective vs Guardrailmetrics

To describe the generic problem formulation, let us begin withsome notation. Let J denote the total number of different choices orvalues of the parameter that we wish to optimize. We assume thatthere are K + 1 different metrics that we wish to track, where wehave one primary metric and K guardrail metrics. Let C1, . . . ,Cndenote the set of distinct, non-intersecting cohorts which exhibitheterogenous behavior with respect to K + 1 metrics when treatedby the parameter value in question.

On each cohort Ci for i = 1, . . . ,n, let U ki j denote the causal

effect in metric k by applying treatment value j vs control. Wefurther assume that each such random variableU k

i j is distributed asa Gaussian random variable havingmean µki j and standard deviationσki j . For notational simplicity, for each metric k = 0, . . . ,K , letus denote Uk as the vectorized version of U k

i j for i = 1, . . . ,n,j = 1, . . . , J . Using a similar notation we get,

Uk ∼ Nn J

(µk ,Diaд(σ2

k ))

Finally, let us denote xi j as the probability of assigning the j-thtreatment value to the i-th cohort and x be its vectorized version.For ease of readability, we tabulate these in Table 1.

Table 1: The meaning of each symbol in problem formula-tion

Symbol MeaningJ Total number of parameter values or choices.K Total number of guardrail metricsCi i-th cohort for i = 1, . . . ,n.Uk Vectorized version of U k

i j , which is the causaleffect in metric k by parameter j in cohort Ci .

µk Mean of Ukσ2k I Variance of Ukx The assignment vector.

Personalization and Optimization of Decision Parameters via Heterogenous Causal Effects , 2019

(a) Example of Notifications at LinkedIn (b) Example of minGap of Ads in Feed

Figure 2: Notifications and Ads Example

Based on these notations, we can formulate our optimizationproblem. Let k = 0 denote the main metric. We wish to optimize themain metric keeping the guardrail metrics at a threshold. Formally,we wish to get the optimal x∗ by solving the following:

Maximizex

xTU0

subject to xTUk ≤ ck for k = 1, . . . ,K .∑jxi j = 1 ∀i, 0 ≤ x ≤ 1

(1)

where ck are known thresholds. Based on these, the overall problemcan be formalized into two steps.

(1) Identifying member cohorts C1, . . . ,Cn using data from ran-domized experiments

(2) Identifying the optimal parameter x∗ by solving the opti-mization problem (1).

Before discussing the methodology of the solution in more detail(in Section 3), let us provide more context on the two motivatingapplications — Adjusting Notifications threshold and personalizingdensity of advertisements at LinkedIn.

2.1 Notifications delivery at LinkedInLinkedIn notifications have become an increasingly critical chan-nel for social networks to keep users informed and engaged, as aresult of continuously growing mobile phone usage. LinkedIn, as aprofessional social network site, implemented a near real-time opti-mization system for activity-based notifications [10]. The system iscomposed of response prediction models to predict users’ interests(click through rate on positive and negative signals).

As we serve those time-sensitive activity-based notifications(example in Figure 2a) in a streaming fashion, we introduced amodel threshold parameter to make send or drop decision on each

candidate notification. To avoid overwhelming users, the systemintroduced a per-user constraint that enforces each user to receiveat most a few notifications per day. As a result, the choices ofthe model threshold parameter would have an impact on both thequantity and the quality of notifications.

The system initially had a global model threshold which wassending sub-optimal quality notifications to active users while fail-ing to send notifications to some less active users. To improve thesystem, we introduced a personalized threshold strategy to reshapeuser coverage in finer granularity. In prior work, we grouped usersinto four segments according to their activeness: daily, weekly,monthly, and dormant users. Then, we fine-tune the parameters topersonalize at the pre-defined group level. However, the choice ofthe cohorts was ad-hoc and not necessarily the optimal for havingthe strongest heterogeneous effects across the metrics at hand.

Thus, the main problems that we solve in this paper are:

• Automatically identifying cohorts which show the strongestheterogenous effects across the treatments (e.g., send thresh-old)

• Solving an optimization problem to assign optimal treatmentto each of these cohorts.

Once we know the exact cohorts, the optimization problem in thisexample can be written out exactly as (1) where the primary metricis sessions, while the guardrail metrics are contributions, sendvolume, click through rate, notification disabled users and totalnotification disables.

2.2 Ranking Ads in LinkedIn FeedLinkedIn Feed contains various types of updates. At a high level,they can be split into two categories: sponsored (ads) and organicupdates. Organic updates are the main driver for engagement, while

, 2019 Tu et al.

sponsored updates are the source of revenue. Both engagement andrevenue are important yet conflicting because slots on the newsfeedare shared. How to rank both organic and sponsored updates tobalance engagement and revenue is a critical problem [1–3].

One solution is slotting sponsored updates at fixed positions forall members. A simplistic solution is to control the ads density usinga parameter minGap, that controls the number of organic updatesbetween two consecutive ads. 1 Figure 2b shows an example of howminGap controls the position of ads. This solution is suboptimalbecause it cannot capture the variability in members’ preference ofads density using a single global parameter.

Instead, we formulate it as an optimization problem to select thebest gap for each cohort of members. The objective is to maximizerevenue with constraints on organic and ads engagement utility.Mathematically, the problem can be formulated as:

Maximizexi j

n∑i=1

J∑j=1

xi jri j

such thatn∑i=1

J∑j=1

xi jci j ≥ wcccontrol

n∑i=1

J∑j=1

xi joi j ≥ woocontrol∑jxi j = 1,∀i, 0 ≤ xi j ≤ 1.

(2)

where ri j is the estimated causal effect on revenue (the primarymetric), and ci j and oi j are the estimated causal effect on ads engage-ment and organic engagement respectively (the guardrail metrics)for cohort Ci and treatment j. wc and wo denote the weights forthe tradeoff when comparing with the control for both the utilities.Note that this optimization problem is also an explicit example ofthe generic problem stated in (1).

3 METHODOLOGYSo far, we have described a generic problem (1) and two explicitexamples in the LinkedIn ecosystem. In this section, we detail outour proposal to solve the problem. We first begin with how we canautomatically identify the cohorts of members who show the mostheterogeneous effect instead of picking them in an ad-hoc manner.We then describe how we solve the optimization problem to assignparameter values to each cohort.

3.1 Heterogenous CohortsEstimation of heterogeneous treatment effects is a well-studiedtopic in social and medical sciences [5, 26]. In this paper, we followthe potential outcomes framework from Rubin (1974) [21] andconsider the following assumptions:

• Stable Under the Treatment Value Assumption (SUTVA) [21],which states that the response of the treatment unit onlydepends on the allocated treatment to that unit and not onthe treatment given to other units.

1The actual solution we use at LinkedIn considers the quality of organic and sponsoredupdates in a session to make the final decision, but the minGap plays an importantrole nonetheless in determining the ads density.

• Stongly Ignorable Treatment Assignment [20], which com-bines the assumption of unconfoundedness and overlap. Werefer to [20] for the details.

Throughout this section, we assume that we have access to datafrom randomized experiments. That is, we have observed the val-ues for each metric k = 0, . . . ,K , by giving treatment value j toa randomized group of experimental units, for j = 1, . . . , J . Foreach treatment j and each metric k , we use the recursive parti-tioning technique from [5] to identify the heterogeneous causaleffects. The Causal Tree Algorithm from [5] generates a tree andhence a partition of the entire set of members into disjoint cohorts{C1, . . . ,Cn }jk . The subscript jk signifies the cohort definition fromtreatment j and metric k . For sake of completeness, we briefly de-scribe the causal tree algorithm and how we combine the J (K + 1)different trees to finally get one cohort definition.

3.1.1 Causal Tree Algorithm. The causal tree algorithm [5] isbased on the conventional CART algorithm [7], which recursivelypartitions the observations of the training sample to build the initialtree and uses cross-validation to select a complexity parameter toprune the tree. There are major differences in estimating heteroge-nous causal effects versus the standard response prediction problemwhich CART tries to solve. Specifically,

• The goal is to estimate the delta effects between the treat-ment and control groups τi = Y1i − Y0i , instead of the out-come variable yi .

• As member-level treatment effects is not observable (a mem-ber can not be in both treatment and control groups at thesame time), we would need to estimate τi based on the con-ditional average treatment effects of units in the same leaf ℓas, τi = τℓ = E(Y1ℓ) − E(Y0ℓ) where i ∈ ℓ.

Due to these differences, the causal tree algorithm modifies theCART criteria, to generate the heterogenous cohorts. In order toensure high confidence on the leaf-level estimations, the algorithmintroduces a new regularization term in the splitting objective tominimize leaf level variance.[5] introduces a procedure called hon-est estimation to partition data into training, estimation and testingthreefolds (Str , Sest , Scv , with size of the data N tr , N est , N cv ),and propose to use a separate dataset for tree training vs estimationto keep the estimated effects honest. The tree splitting criteria forcausal tree algorithm is to maximize:

−EMSEτ (Str ,N est ,Π,α) = α

N tr

∑i ∈S tr

τ 2(Xi ; Str ,Π)

− (1 − α)(

1N tr +

1N est

) ∑ℓ∈Π

(σ 2S tr tr eat

(ℓ)p

+σ 2S tr control

(ℓ)(1 − p)

)(3)

where Π is a tree, α is a hyperparameter we can tune to adjust theweight between the first part (MSE) and the second part (sum ofwithin-leaf variance) of the objective; p represents the fraction ofrecords in treatment out of total for each leaf ℓ. After the trainingphase, we conduct cross-validation similar the standard predictionproblem to maximize the cross validation objective.

3.1.2 Combining Multiple Trees. Note that for each treatmentlevel j and each metric k we can identify heterogenous cohortsusing the Causal Tree Algorithm. One of the major challenges is to


Algorithm 1Merging Trees

1: Input : A set of L collection of cohorts: {{Cℓi }

nℓi=1 | ℓ = 1, . . . ,L}.

2: {Cout } = {C1i }

n1i=1

3: for ℓ = 2, . . . ,L do4: Cbase = Cout

5: for S1 ∈ Cbase do6: for S2 ∈ {Cℓ}nℓ

i=1 do7: Sint = S1 ∩ S28: if isStatSig({Sint , (S1 \ Sint )}) then9: UpdateCout by splitting S1 into Sint and S1 \Sint10: Set S1 = S1 \ Sint11: end if12: end for13: end for14: end for15: return {Cout }

combine these J (K + 1) different trees into a single tree so that wecan have a single collection of cohorts.

One option is to estimate member-level metric impacts by usingall the trees and then assign the decision parameter at the memberlevel. However, that would have scalability concerns in the opti-mization problem. A better alternative is to merge the tree modelsinto one single cohort assignment. Various mathematical and datamining approaches can be used in this task [23].

We explored an idea that sequentially merges pairs of trees2,which is illustrated as Algorithm 1. We start by picking a base co-hort definition and consider every other cohort definition througha looping mechanism. We pick any cohort S1 in our base partitionand depending on the sub-partition due to the other cohort defini-tion, we consider splitting S1 if all splits are statistically significantfor some metric k . For further details please see Section 9. We leaveit as future work to identify more formal ways of defining a sin-gle collection of cohorts across multiple treatments and multiplemetrics.

3.2 Stochastic OptimizationNote that, the problem (1) is stochastic since both the objectivefunction and the constraints used in the optimization formulationare not deterministic but are coming from a particular distribution(Gaussian in the above case). More formally, we can rewrite theoptimization problem in (1) as:

Maximizex

f (x) = E(xTU0)

subject to дk (x) := E(xTUk − ck ) ≤ 0, k = 1, . . . ,K .∑jxi j = 1 ∀ i, 0 ≤ x ≤ 1

(4)

Note that in the above optimization formulation, the objectivefunction f and the constraints дj are not directly observed butobserved via the realizations of the random variables U0, . . . ,UK .

One method of solving such a problem in literature is known assample average approximation (SAA) [13]. In SAA, instead of solvingthe stochastic problem, it replaces the stochastic objective and the2Further details for ease of reproducibility is pushed to Section 9

Algorithm 2 Multiple Cooperative Stochastic Approximation

1: Input : Initial x1 ∈ X, Tolerances {ηk }t , {γ }t , Total iterationsN

2: for t = 1, . . . ,N do3: Estimate Gk,t for all k ∈ 1, . . . ,K using (6).4: if G j,t ≤ ηj,t for all j then5: Set ht = F ′(xt ,U0,t )6: else7: Randomly select k∗ from {k : Gk,t > ηk,t }8: Set ht = G ′

k∗ (xt ,Uk∗,t )9: end if10: Compute xt+1 = Pxt (γtht )11: end for12: Define B =

{1 ≤ t ≤ N : Gk,t ≤ ηk,t ∀k ∈ {1, . . . ,K}

}13: return x :=

∑t∈B xtγt∑t∈B γt

stochastic constraints via its empirical sample expectation. In manycases, such as ours, the SAA solution might lead to an infeasibleproblem, due to the large variance in the random variable. Capturingthe mean may not be enough. Moreover, SAA is computationallychallenging for a general function G j . Another method of solvinga similar problem is known as stochastic approximation (SA) [19],which has a projection step onto {дj (x) ≤ 0} which may not bepossible in this situation.

To have a more formal solution we generalize the CoordinatedStochastic Approximation (CSA) algorithm from [15]. The CSAalgorithm in [15] solves the problem with a single constraint whichwe modify to work with multiple expectation constraints. Specifi-cally, we start by framing the above problem as

Minimizex

f (x) := EU0 (F (x,U0))

subject to дk (x) := EUk(G j (x,Uk )

)≤ 0 for k = 1, . . . ,K .

x ∈ X(5)

here X denotes the set of non-stochastic constraints.Our algorithm, which we call Multiple Coordinated Stochastic

Approximation (MCSA), is an iterative algorithm which runs for Nsteps. At each step t we start by estimating the constraint function.Specifically, we simulate Ukℓ for ℓ = 1, . . . ,L and estimate,

Gk,t =1L

L∑ℓ=1

Gk (xt ,Ukℓ). (6)

Then if all the estimated constraints Gk,t are less than a thresholdηk,t , we choose our gradient to be the stochastic gradient of theobjective function, F ′(xt ,U0,t ). Otherwise, from the set of violatedconstraints {k : Gk,t > ηk,t }, we choose a constraint k∗ at randomand consider the gradient to be the stochastic gradient of thatconstraint,G ′

k∗ (xt ,Uk∗,t ). We then move along the chosen gradientfor a particular step-size γt and apply the traditional proximalprojection operator to get the next point xt+1. Our final optimalpoint is the weighted average of those xt where we have actuallytravelled along the gradient of the objective. The overall algorithmto solve (5) is now written out as Algorithm 2.

, 2019 Tu et al.

It has been theoretically proven in [15] that if there was a singlestochastic constraint, then the number of iterations N required toto come ϵ close to the optimal both with respect to the objectiveand the constraint is of the order O(1/ϵ2). Following a very similarproof strategy it is not hard to see that we can carefully choosethe tolerances {ηk }t , {γ }t such that if we follow Algorithm 2 forN steps, then

E(f (x) − f (x∗)) ≤ c0√N

and E(дk (x)) ≤ck√N

∀k ∈ {1, . . . ,K}

We do not go into the details of the proof in this paper for brevity.Please see [15] for further details.

4 DATA COLLECTIONNow we illustrate the process of data collection and processingto personalize and optimize parameters in both the notificationsand ads use cases. We collect experimental data to test a collectionof parameters where the treatment assignment are random andthe regular assumptions of causal inference [20] hold. To measurethe effect of the parameters, we carefully select a list of primaryand guardrail metrics for each use case. In addition to the metricdata, we also keep track of the allocation of treatments and controlgroups and appropriate features.

4.1 Personalize Notification ThresholdsA set of prior experiments exist which tested a few model thresholdfor the daily active users (DAUs) while holding the thresholds forthe rest of users the same. We leverage those past tests data togenerate cohort-parameter mapping for our validation tests. Wecollected two weeks of the experimental data where the metrics ofinterested are described in Table 2. We include three major cate-gories of features in the heterogeneous cohorts learning the processas follows:

• User profile features, such as country, preferred language,job title, etc.

• Notification specific features, such as past notifications andemails received.

• Other activity features including past visits, feed consump-tions, likes, shares, comments, etc.

• Features from users’ network (such as the number of con-nections/followers).

Table 2: Metrics of Interests for Notifications Usecase

Metrics DescriptionsSessions (Objective) Total sessions (a measure of visits).Daily Contributors Daily unique number of users who gener-

ate or distribute (e.g.: share, like, comment,message) contents

Notification Sends Total volume of notifications send to usersNotification CTR Click through rate on in-app notificationsNotification DisableUsers

Number of users who disable the notifica-tion

Notification Dis-ables

Total number of disables on notificaitons

4.2 Personalize Ads DensityWe conduct experiments to rank ads in the LinkedIn feed based ondifferentminGap parameter. We collect three months of experimen-tal data. The metrics of interest, composed of the primary metric:revenue and a set of guardrail metrics, are illustrated in Table 3. Inaddition to the user profiles, activities, network features that wereincluded in the ads study, we add a set of ads specific features (suchas user and ads affinity features like past click-through rates (CTRs)and interactions on ads) in identifying the heterogeneous cohorts.

Table 3: Metrics of Interests for Ads Usecase

Metrics DescriptionsRevenue (Objective) Total Estimated Revenue.Ads CTR Click Through Rate on all advertisements.Ads Views Views on all advertisements.Feed Viral Actors Unique number of users who make viral

actions (like, share, comment) at Feed.Feed Interactions Total interactions at Feed.

5 SYSTEM ARCHITECTUREWe outline the design of the engineering architecture for the param-eter personalization and optimization system in Figure 3. It consistsof two major components: one for heterogeneous cohorts learningand the other for stochastic optimization of the parameters. Bothsystems would leverage tracking data (metrics and features) andrandomized experiment data (the allocation of users in treatmentsor control groups).

Tracking

Data Processing

Stochastic Optimization

Users

Heterogenous Cohort Learning System Stochastic Optimization System

Causal Tree

Trainings

Estimate Cohort Level

Effects

Randomized Experiment

Data

Score and generate

parameters

Cohort-Parameter

Maps

Combine Multiple

Trees

Finalized Heterogenous

Cohorts

Figure 3: System Architecture

The heterogeneous cohort learning system is similar to a stan-dard response prediction training system where it is composed ofthe following pipelines:

• A flow for processing tracking and randomized test data togenerate training and cross-validation sets,

• A training pipeline to explore different hyper-parameters(e.g.: α in (3); or standard tree model training parameters)


and generate the cohorts for each treatment level j and eachmetric of interest k .

• A flow to combine multiple trees to come up with a finalcohort definition via Algorithm 1.

The output of the first system (a finalized cohort definition)together with the tracking and experiment allocation data serve asthe inputs for the stochastic optimization system. In this stage, weestimate the cohort level effects for each treatment level j andmetricof interest k and fit the estimations in the stochastic optimizationpipeline to generate final cohort-parameter maps.

We developed the whole system offline using Apache Spark fordata processing and leveraged packages in R for causal tree modeltraining and stochastic optimization. The system produces the co-hort definition and the mapping between cohorts and parameterswhich can be encoded as a model file. Applying the personalizedparameters in an online system only requires a light-weight treemodel scoring and post-processing flows to assign the parameters.The solution would apply to a wide range of online systems withoutintroducing a significant level of additional latency.

6 EXPERIMENTS6.1 Offline AnalysisIn this section, we present the offline analysis results for bothNotification and Ads use cases and share our learnings and somepractical tips. All results shown in section are statistically significantwith a p-value < 0.05, obtained from repeated experiments.

6.1.1 Personalize Notification Threshold. As we briefly men-tioned in Section 2, the baseline model for notification is person-alized to the pre-defined cohort’s level. For the initial validationtests, we directly leveraged the past experiment data that tune thethreshold for daily active users (DAU). Therefore, our experimentwould first focus on personalizing and optimizing thresholds forDAUs.

Figure 4a-1 visualizes the cohorts generated using an optimizedvalue of α . We have seen from our experiments that a lower valueof α would lead to smaller number of cohorts and lower variance(hence higher confidence) on the cohort-level effect estimations;and a high value of α would result in a larger set of cohorts andlower confidence in estimation. With a finalized cohorts set andthe cohort-level effect estimations, we ran stochastic optimizationanalyses using both SGD and Adagrad options 3.

We observe the SGD method converges faster than Adagrad inthe Notification use case as seen from Figure 4a-2. Offline analy-sis suggested very promising results as seen in Table 4. We canpotentially lift the sessions metric without any tradeoff on theother guardrail metrics (e.g.: daily contributors, notification CTR,notification disable users).

6.1.2 Personalize Ads Density. Figure 4b-1 shows the visualiza-tion of the heterogeneous cohorts selected in terms of two metrics:the objective: revenue and one of the major constraints: Ads CTR.The x-axis shows the average treatment effect per cohort on therevenue metric, and the y-axis represents the impacts on Ads CTR.The horizontal and vertical error bars accordingly represent the3SGD and Adagrad are different choices of the proximal projection operator in Algo-rithm 2. For further details please see Section 9

Table 4: Offline Estimation on LinkedIn Notification

Metrics (Good Direction) Adagrad SGDSessions (Up) +0.48% +0.49%

Daily Contributors (Up) +0.12% +0.12%Notification Send Volume (Up) +0.59% +0.63%

Notification CTR (Up) +0.02% +0.01%Notification Disable Users (Down) −0.02% −0.02%Notification Total Disables (Down) −0.1% −0.1%

standard deviation of the effects on revenue and Ads CTR. It is clearfrom the figure that the cohorts are nicely separated especially interms of heterogeneity in revenue effects.

Figure 4b-2 shows the expected effect on the revenue metricalong with the increase of iterations for both SGD and Adagrad.We attempt to increase revenue without lifting the total Ads viewswhile Ads revenue directly depends on the views. Due to the con-flicting nature between the objective and constraint, it is difficultto achieve constraints while improving the objective. Unlike thecase in Notifications, we observe that the Adagrad method tends toconverge much faster than SGD in this scenario.

We also conducted a few more tests using simulated data andobserved that usually, SGD performs better than Adagrad in termsof convergence if the constraints are easy to satisfy. On the otherhand, for hard to satisfy constraints, Adagrad generally tends tooutperform SGD4.

Table 5 demonstrates our offline estimation of the total effectson Ads use case. We present a comparison of applying pre-definedcohorts (based on member activeness level) vs the causal tree basedcohorts, where both cases utilized the stochastic optimization toselect parameters. Offline estimations show that the causal treebased solution would outperform the pre-defined cohorts solution.We estimate a much higher revenue lift (+2.66% compared with+0.45%) with a similar trade-off on the guardrail metrics (e.g.: "FeedViral Actors" and "Feed Interactions").

Table 5: Offline Estimations for LinkedInAdswithAds ViewConstraint

Metrics (Direction) Cohorts Adagrad SGDPre-defined +0.48% +0.42%

Revenue (Up) Causal Tree +2.66% +1.32%

Pre-defined +0.44% +0.45%Ads CTR (Up) Causal Tree +0.38% +0.33%

Pre-defined −0.08% −0.01%Ads Views (Down) Causal Tree −0.01% 0%

Pre-defined +0.02% +0.02%Feed Viral Actors (Up) Causal Tree +0.05% 0%

Pre-defined +0.01% −0.01%Feed Interactions (Up) Causal Tree −0.02% −0.02%

4See Section 9 for further details

, 2019 Tu et al.

(a) Offline Results in LinkedIn Notifications

●

●

●

●

●

●

●

●

●

●

●

● ●

●

Sessions

Not

ifica

tion

CT

R

(a-1) Heterogeneous Cohorts (a-2) Expectated Sessions Lift

(b) Offline Results in LinkedIn Ads

● ●

●

●

●

●

●

●

●

●

●

Revenue

Ads

CT

R

(b-1) Heterogeneous Cohorts (b-2) Expected Revenue Lift

Figure 4: Heterogenous cohorts in Notifications and Ads, where each color in (a-1) and (b-1) corresponds to one cohort. Figures(a-2) and (b-2) show the expected lift in revenue and sessions in the two corresponding problems as function along iterationsof the Algorithm 2.

6.2 Online ExperimentsOnline A/B testing has become the golden standard in evaluat-ing product developments for web-facing companies. We launchedonline tests to further validate our method. Due to the high costof implementing the pipeline and launching online A/B tests, wedesigned our initial online validation tests differently for the Notifi-cation and Ads use cases to ensure covering a few different aspects.All reported metric lifts in this section have a p-value of less than0.05. Otherwise we report the lift as neutral.

6.2.1 Notification Tests. As the notification use case alreadyhad a fine tuned cohort level thresholds as the baseline, we directlylaunched our full solution trained by past notification experimentdata.We observed a significant positive impact on the sessionmetricespecially for DAUs, while the impacts on the constraint metricsmostly remain neutral (also observe a lift in daily contributorsmetric for DAUs) as shown in Table 6. The results are very close toour offline estimations. The validation test demonstrate the hugepotential of our mechanism in online setting. We are confidentthe impacts of the method will be much larger when we expandthe work to personalize model thresholds for all LinkedIn users(current work focused on DAUs).

Table 6: Notification Experiment Results

Metrics Delta % EffectsAll Users DAUs

Sessions +0.37% +0.56%Daily Contributors Neutral +0.42%

Notification Send Volume Neutral NeutralNotification CTR Neutral Neutral

Notification Disable Users Neutral NeutralNotification Total Disables Neutral Neutral

6.2.2 Ads Tests. We conducted online experiment for the solu-tion with activity based cohorts. Table 7 shows the result of a oneweek experiment for the solution compared to slotting mechanism.All numbers in the table has p-value less than 0.01. In the problem

formulation, we added a constraint on ads impressions to comparethe efficiency of two mechanisms: whether we can deliver betterrevenue and engagement given the same amount of ads impres-sions. In the offline simulation, the solution on activity based cohorthas slightly negative (−0.01%) ads impressions. However, in onlineexperiment result, we observed a significant drop (−1.09%) of adsimpressions. It’s caused by strong seasonality in the ads ecosystem.The solution learned from past data may not be able to capture newtrend in the revenue utility. An improvement is to solve the problemwith new data more frequently to accommodate such dynamics.Unfortunately during our experimentation our ads system releaseda new feature that changed our problem formulation, which is outof the scope of this paper. We keep it as a future work to use theoptimal cohorts instead of the ad-hoc activity based cohorts.

Table 7: Ads Experiment Results

Metrics Delta % EffectsAll Users DAUs WAUs MAUs

Revenue Neutral −1.52% +3.47% +3.31%Ads-CTR +0.92% +1.76% Neutral NeutralAds-impressions -1.09% −3.44% +3.32% +5.84%Feed Viral Actions Neutral Neutral Neutral NeutralFeed Interactions Neutral Neutral Neutral Neutral

Note that, by modulating the ads impressions differently in dif-ferent activity based segments, we were able to reduce the numberof overall ads impressions, but we did not see any negative impactin revenue. Engagement metrics were also not hampered in thisexperiment, which gave us a win-win situation.

7 RELATEDWORKEstimating heterogeneity in causal effect has always been an im-portant research area and well studied by many researchers acrossdomains. In recent years, a growing literature seeks to apply super-vised machine learning methods to the problem [9, 12, 24]. Manyresearchers apply treemodels and corresponding ensemblemethodssince they can automatically identify of nonlinear relationships andinteractions [5, 11, 26]. Among the works, the causal tree and forest


algorithms from [5] and [26] stands out as a popular choice whichintroduces novel honest estimation concept and shows substantialimprovements in coverage of confidence intervals.

In this paper, we have also described an algorithm for stochasticoptimization with multiple expectation constraints based on theCSA algorithm developed in [15]. Traditionally, stochastic optimiza-tion routines were solved either via sample average approximation(SAA) [13, 27] or via stochastic approximation [19]. For details see[6] and [22]. There have been several papers showing improvementover the original SA method especially for strongly convex prob-lems [18], and for a general class of non-smooth convex stochasticprogramming problems [17]. However, note that these methodsmay not be directly applicable to the case where each дk is anexpectation constraint. Very recently [29] developed a method ofstochastic online optimization with general stochastic constraintsby generalizing Zinkevich’s online convex optimization. Their ap-proach is much more general and as a result may be not the optimalapproach to solving this specific problem. For more related worksin stochastic optimization literature, we refer to [15] and [29] andthe references therein.

8 DISCUSSIONIn this paper, we introduce a novelmechanism to personalize and op-timize parameters using randomized experiment data. The methodconsists of two major stages: identifying heterogeneous cohorts;and solving a stochastic optimization problem at the cohort levelto identify the optimal parameter for each cohort. We applied themethod in two LinkedIn use cases: Personalizing Notification modelthreshold; and Ranking Ads on Feed. Both offline analysis and on-line experiments show results which validate our approach. As themethod leverages randomized experiment data, the estimated effectis causal, leading to the minor offline and online discrepancy. Thesolutions produced by the approach can be applied in any onlinesystem to personalize various decision parameters (including UIparameters) with trivial cost in terms of memory and latency.

A few non-trivial, but likely impactful extensions are: (1) Thecost of collecting online A/B test data could be large in terms oftime and potential negative member experiences for some use cases.Designing a more cost-efficient data collection framework or lever-aging observational data to achieve the same performance wouldbe beneficial. (2) Users can potentially move in and out of cohorts.Extending this framework to incorporate the dynamic nature ofcohorts could be an interesting research topic. (3) As mentioned inSection 3.1.2, future work on generating one single cohort defini-tion based on effects from multiple treatments with various metricsof interests could further improve the method.

REFERENCES[1] Deepak Agarwal, Kinjal Basu, Souvik Ghosh, Ying Xuan, Yang Yang, and Liang

Zhang. 2018. Online Parameter Selection for Web-based Ranking Problems. InProceedings of the 24th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining (KDD ’18). ACM, New York, NY, USA, 23–32.

[2] Deepak Agarwal, Bee-Chung Chen, Rupesh Gupta, Joshua Hartman, Qi He,Anand Iyer, Sumanth Kolar, Yiming Ma, Pannagadatta Shivaswamy, Ajit Singh,and Liang Zhang. 2014. Activity Ranking in LinkedIn Feed. In KDD. ACM, NewYork, NY, USA, 1603–1612.

[3] Deepak Agarwal, Bee-Chung Chen, Qi He, Zhenhao Hua, Guy Lebanon, YimingMa, Pannagadatta Shivaswamy, Hsiao-Ping Tseng, Jaewon Yang, and Liang Zhang.2015. Personalizing LinkedIn Feed. In Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining (KDD ’15).

ACM, New York, NY, USA, 1651–1660.[4] Deepak Agarwal, Souvik Ghosh, Kai Wei, and Siyu You. 2014. Budget Pacing

for Targeted Online Advertisements at LinkedIn. In Proceedings of the 20th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’14). ACM, New York, NY, USA, 1613–1619.

[5] Susan Athey and Guido Imbens. 2016. Recursive partitioning for heterogeneouscausal effects. Proceedings of the National Academy of Sciences 113, 27 (2016),7353–7360.

[6] Albert Benveniste, Michel Métivier, and Pierre Priouret. 2012. Adaptive algorithmsand stochastic approximations. Vol. 22. Springer Science & Business Media.

[7] Leo Breiman. 2017. Classification and Regression Trees. Routledge.[8] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods

for online learning and stochastic optimization. Journal of Machine LearningResearch 12, Jul (2011), 2121–2159.

[9] Jared C. Foster, Jeremy M.G. Taylor, and Stephen J. Ruberg. 2011. Subgroupidentification from randomized clinical trial data. Statistics in Medicine 30, 24(2011), 2867–2880.

[10] Yan Gao, Viral Gupta, Jinyun Yan, Changji Shi, Zhongen Tao, PJ Xiao, CurtisWang, Shipeng Yu, Romer Rosales, Ajith Muralidharan, and Shaunak Chatterjee.2018. Near Real-time Optimization of Activity-based Notifications. In Proceedingsof the 24th ACM SIGKDD International Conference on Knowledge Discovery & DataMining (KDD ’18). ACM, New York, NY, USA, 283–292.

[11] Donald P. Green and Holger L. Kern. 2012. Modeling Heterogeneous TreatmentEffects in Survey Experiments with Bayesian Additive Regression Trees. PublicOpinion Quarterly 76, 3 (09 2012), 491–511.

[12] Kosuke Imai and Marc Ratkovic. 2013. Estimating treatment effect heterogeneityin randomized program evaluation. Ann. Appl. Stat. 7, 1 (03 2013), 443–470.

[13] Anton J Kleywegt, Alexander Shapiro, and Tito Homem-de Mello. 2002. Thesample average approximation method for stochastic discrete optimization. SIAMJournal on Optimization 12, 2 (2002), 479–502.

[14] Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann.2013. Online Controlled Experiments at Large Scale. In Proceedings of the 19thACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’13). ACM, New York, NY, USA, 1168–1176.

[15] Guanghui Lan and Zhiqiang Zhou. 2016. Algorithms for stochastic optimizationwith expectation constraints. arXiv preprint arXiv:1604.03887 (2016).

[16] Haishan Liu, David Pardoe, Kun Liu, Manoj Thakur, Frank Cao, and Chongzhe Li.2016. Audience Expansion for Online Social Network Advertising. In Proceedingsof the 22Nd ACM SIGKDD International Conference on Knowledge Discovery andData Mining (KDD ’16). ACM, New York, NY, USA, 165–174.

[17] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro.2009. Robust stochastic approximation approach to stochastic programming.SIAM Journal on optimization 19, 4 (2009), 1574–1609.

[18] Boris T Polyak and Anatoli B Juditsky. 1992. Acceleration of stochastic approx-imation by averaging. SIAM Journal on Control and Optimization 30, 4 (1992),838–855.

[19] Herbert Robbins and Sutton Monro. 1951. A Stochastic Approximation Method.The Annals of Mathematical Statistics 22, 3 (1951), 400–407.

[20] Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensityscore in observational studies for causal effects. Biometrika 70, 1 (1983), 41–55.

[21] Donald B Rubin. 1974. Estimating causal effects of treatments in randomized andnonrandomized studies. Journal of educational Psychology 66, 5 (1974), 688.

[22] James C Spall. 2005. Introduction to stochastic search and optimization: estimation,simulation, and control. Vol. 65. John Wiley & Sons.

[23] Pedro Strecht. 2015. A Survey of Merging Decision Trees Data Mining Ap-proaches. In Proc. 10th Doctoral Symposium in Informatics Engineering. 36–47.

[24] Matt Taddy, Matt Gardner, Liyun Chen, and David Draper. 2016. A nonparametricbayesian analysis of heterogenous treatment effects in digital experimentation.Journal of Business & Economic Statistics 34, 4 (2016), 661–672.

[25] Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. 2010. Over-lapping Experiment Infrastructure: More, Better, Faster Experimentation. InProceedings of the 16th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD ’10). ACM, New York, NY, USA, 17–26.

[26] Stefan Wager and Susan Athey. 2018. Estimation and Inference of HeterogeneousTreatment Effects using Random Forests. J. Amer. Statist. Assoc. 113, 523 (2018),1228–1242.

[27] Wei Wang and Shabbir Ahmed. 2008. Sample average approximation of expectedvalue constrained stochastic programs. Operations Research Letters 36, 5 (2008),515–519.

[28] Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015.From Infrastructure to Culture: A/B Testing Challenges in Large Scale SocialNetworks. In Proceedings of the 21th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (KDD ’15). ACM, New York, NY, USA,2227–2236.

[29] Hao Yu, Michael J. Neely, and Xiaohan Wei. 2017. Online Convex Optimizationwith Stochastic Constraints. In Proceedings of the 31st International Conferenceon Neural Information Processing Systems (NIPS’17). Curran Associates Inc., USA,1427–1437.

, 2019 Tu et al.

9 REPRODUCIBILITYIn this section, we describe some further details that should help thereader to reproduce the methodology described in this paper. Wealso provide the R code that has been used to run all experiments.Due to the sensitive nature of the data, we are not able to disclosethat actual data used in our experiments but the practitioner canuse the details in this section and the code provided to run thissystem for their dataset.

9.1 Details on Algorithm 1The choice of the first cohort which is set in line 2 is important sinceit determines the final set of cohorts. In our experiments, we haveset that to be the cohort definition corresponding to the primarymetric k = 0 and the treatment j which showed the highest lift intreatment effect across the different cohorts.

For completeness, we now describe the details of the isStatSiд(·)function which has been used in Algorithm 1. The function takes ina collection of sets and identifies if each set is statistically significantor not when compared across all the K + 1 metrics. If there existsa metric k ∈ {0, . . . ,K} for which all the sets show statisticallysignificant treatment effect, then we return true, else we returnfalse. More formally, the details are given in Algorithm 3.

Algorithm 3 Testing For Statistical Significance: isStatSiд(·)1: Input : A collection of Sets {S1, . . . , Sn }.2: for k = 0, . . . ,K do3: Set flag = true4: for S ∈ {S1, . . . , Sn } do5: if S is not statistically significant for metric k then6: Set flag = false7: Break inner loop8: end if9: end for10: if flag = true then11: return true12: end if13: end for14: return false

9.2 Details on Algorithm 2The proximal projection function Pxt (·) used in Algorithm 2 isdefined as

Pxt (·) := argminz∈X

{⟨·, z⟩ + Bψt (z, xt )

}(7)

where Bψt (z, xt ) is the Bregman divergence with respect to the1-strongly convex functionψt , defined as

Bψt (x ,y) = ψt (x) −ψt (y) − ⟨∇ψt (y),x − y⟩. (8)

For more details we refer the reader to [8]. In this paper, we havefocused on two specific forms of functionsψt .

(1) Stochastic Gradient Descent: In this case, we use

ψt (x) = xT x .

(2) Adagrad: Here we pickψt (x) = xTHtx where

Ht = δ I + diaд( t∑j=1

hjhTj

)1/2and hj are the chosen gradients in steps 5 and 8 of Algorithm2.

The comparison of each of the above two choices has already beendiscussed in Section 6. Since we could not attach the real data dueto to privacy concerns, we are adding a simulated result, which thereader can reproduce using the code given below.

We consider a problem having a success metric and two guardrailmetrics and a treatment having 10 different levels. We generate arandom distribution for each metric k and each treatment level j.We run the algorithm for N = 200 iterations and choose 50 samplesto estimate the constraint at each step. We repeat both SGD andAdagrad 10 times and show the growth of the objective and therestriction on the constraint as the iterations increase in Figure 5.

0 50 100 150 200

Iterations

Exp

ecta

tion

of E

ffect

Adagrad_Constraint

Adagrad_Obj

SGD_Constraint

SGD_Obj

Figure 5: Simulated Results

Note that we are not claiming that one method is better thanthe other. However in our simulations, we have usually observedthat in simpler problems, SDG tends to converge faster, while intougher problem, Adagrad tends to work better.

9.3 Code DetailsWe share example scripts for merging trees and running stochas-tic optimization algorithms with simulation data in the follow-ing Github link: https://github.com/tuye0305/kdd2019prophet. Thesource code contains two main files:

(1) MergeTreeSimulation.R - This code helps us combine thedifferent cohorts corresponding to each metric and eachtreatment into a single cohort definition.

(2) StochasticOpSimulation.R - This code runs the stochasticoptimization routine to identify the optimal parameter ineach cohort. Running the code as is should generate the plotas shown in Figure 5.

We have also used the open source R library called Causal Tree(https://github.com/susanathey/causalTree) to identify the cohortsfor each treatment j and metric k . Using these, the entire methodol-ogy discussed in this paper can be easily reproduced for any similarproblem of interest.

Documents

Personalization and Optimization of Decision Parameters ...Personalization and Optimization of Decision Parameters via Heterogenous Causal Effects , 2019 (a) Example of Notifications