Five Challenging Problems for A/B/n Tests Slides at (Follow-on talk to KDD 2015 keynote on Online Controlled Experiments: Lessons

Five Challenging Problems for A/B/n Tests

Slides at http://bit.ly/DSS2015Kohavi

(Follow-on talk to KDD 2015 keynote on Online Controlled Experiments: Lessons from Running A/B/n Tests for 12 years at http://bit.ly/KDD2015Kohavi)

Ron Kohavi, Distinguished Engineer, General Manager, Analysis and Experimentation, Microsoft

http://t.co/1l42Q9qZi7

http://bit.ly/KDD2015Kohavi

http://bit.ly/KDD2015Kohavi

2

Australia: Learning from Daintree Forest

Ronny Kohavi

• Anyone know what this heart-shaped leaf, which Ted and I saw yesterday,is called? Why it is interesting?

• Stinging bush (Dendrocnide moroides)

• Silica hairs deliver a potent neurotoxin

• The sting can last months

• Wikipedia references articles about• Horses jumping off cliffs after being stung• An Australian officer shot himself to escape

the pain of a sting

• What you don’t know CAN kill you here

3Ronny Kohavi

A/B/n Tests in One SlideConcept is trivialRandomly split traffic between two (or more) versionsoA (Control)oB (Treatment)

Collect metrics of interestAnalyze

A/B test is the simplest controlled experimentA/B/n refers to multiple treatments (often used and encouraged: try control + two or three

treatments)

Must run statistical tests to confirm differences are not due to chanceBest scientific way to prove causality, i.e., the changes in metrics are caused by changes

introduced in the treatment(s)

100%Users

50%Users

50%Users

Control:Existing System

Treatment:Existing System with Feature X

Users interactions instrumented, analyzed & compared

Analyze at the end of the experiment

4

Challenge 1: Sessions/User as MetricSearch engines (Bing, Google) are evaluated on query share (distinct queries) and

revenue as long-term goalsObservation: A ranking bug in an experiment resulted in very poor search results Degraded (algorithmic) search results cause users to search more to complete their task, and ads appear

more relevant Distinct queries went up over 10%, and revenue went up over 30%

What metrics should we use as the OEC (Overall Evaluation Criterion) for search engine?

Ronny Kohavi

5

OEC for Search EnginesAnalyzing queries per month, we have

where a session begins with a query and ends with 30-minutes of inactivity. (Ideally, we would look at tasks, not sessions).

Key observation: we want users to find answers and complete tasks quickly, so queries/session should be smaller

In a controlled experiment, the variants get (approximately) the same number of users by design, so the last term is about equal

The OEC should therefore include the middle term: sessions/userThis seems like ideal metric for many sites, not just Bing: increased sessions/visits

Ronny Kohavi

Challenge: Statistical PowerThe t-statistic used to evaluate stat-sig is defined as , where is the difference between a metric for the two variants, and is the (estimated) standard deviation of the difference

is , whereCV is the coefficient of variation of the metricn is the sample sizeNormally, as the experiment runs longer and more users are admitted, the confidence interval should shrink

Here is a graph of the relative confidence interval size for Sessions/User over a month

It is NOT shrinking as expectedCV, normally fixed, is growing at the same rate

as

7

Why is this Important?Given that this metric is Bing’s “north star,” everyone tries to improve this metricDegradations in Sessions/User (commonly due to serious bugs) are quickly stat-sig,

indicating abandonmentPositive movements are extremely rare About two ideas a year are “certified” as having moved Sessions/user positively (out of 10K experiments,

about 1,000-2,000 are successful on other OEC metrics), so 0.02% of the time Certification involves very low p-value (rare) and more commonly replication

Challenges Can we improve the sensitivity?

We published a paper on using pre-experiment data: CUPED, which really helped here. Other ideas? Is there a similar metric that is more sensitive? Is it possible that this metric just can’t be moved much?

Unlikely. Comscore reports Sessions/User for Bing and Google and there’s a factor of two gap

Ronny Kohavi

http://bit.ly/expCUPED

8

Challenge 2: NHST and P-valuesNHST = Null Hypothesis Statistical Testing, the “standard” model commonly usedP-value <= 0.05 is the “standard” for rejecting the Null hypothesisP-value is often mis-interpreted. Here are some incorrect statements from Steve

Goodman’s A Dirty Dozen1. If P = .05, the null hypothesis has only a 5% chance of being true2. A non-significant difference (e.g., P >.05) means there is no difference between groups3. P = .05 means that we have observed data that would occur only 5% of the time under the null

hypothesis4. P = .05 means that if you reject the null hypothesis, the probability of a type I error (false positive) is

only 5%

The problem is that p-value gives us Prob (X >= x | H_0), whereas what we want isProb (H_0 | X >= x)

Ronny Kohavi

9

Why is this Important?Take Sessions/User, a metric that historically moves positively 0.02% of the time at BingWith standard p-value computations, 5% of experiments will show stat-sig movement,

half of those positive. 99.6% of the time, a stat-sig movement with p-value = 0.05 does not mean that the idea

improved Sesssions/UserInitial way to address this: Bayesian. See Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments at

http://exp-platform.com for recent work by Alex DengBasically, we use historical data to set priors. But this assumes the new idea behaves like prior ones.

Ronny Kohavi

http://exp-platform.com/

10

Challenge 3: Interesting SegmentsWhen an A/B experiment completes, can we provide interesting insights by finding

segments of users where the delta was much larger or smaller than the mean?We should be able to apply machine learning methodsInitial ideas, such as Susan Athey’s keynote, create high-variance labelsIssues with multiple hypothesis testing / overfitting

Ronny Kohavi

11

Challenge 4: Duration / Novelty EffectsHow long do experiments need to run? We normally run for one week When we suspect novelty (or primacy) effects, we run longer At Bing, despite running some experiments for 3 months, we rarely see significant changes.

Never saw stat-sig turn into negative stat-sig, for example

Google reported significant long-term impact of showing more ads For example, KDD 2015 paper by Henning etal. on Focusing on the Long-term: It’s Good for Users and

Business We ran the same experiment and have very different conclusionsoWe saw Sessions/user decline. When that happens, most metrics are invalid, as users are abandoningoLong-term experiments on cookie-based user-identification have strong selection bias, as they require the same

user to visit over a long-period. Users erase cookies, lose cookies, etc.

Ronny Kohavi

12

Challenge 5: Leaks Due to Shared ResourcesShared resources are a problem for controlled experimentsExample: LRU caches are used often (least-recently-used elements are replaced) Caches must be experiment aware, as the cached elements often depend on experiments (e.g., search

output depends on query term + backend experiments).The experiments a user is in is therefore part of the cache key

If control and treatment are of different size (e.g., control is 90%, treatment is 10%), then control has a big advantage because its elements are cached more often in an LRU cache

We usually run experiments with equal sizes because of this (e.g., 10% vs. 10%, even if control could be 90% and we would reduce variance)

Ronny Kohavi

13

Challenge 5: Leaks Due to Shared ResourcesUnsolved examples Treatment has a bug and uses disk I/O much more heavily. As the disk is a shared resource, this slows

down control (a leak) and the delta in many metrics is not reflecting the issue Real example: treatment causes the server to crash. As the system is distributed and reasonably resilient

to crashes, requests are routed to other servers and the overall system survives, as long as crashes are not too common.However, in this example the treatment crashed often, bringing Bing down after several hours.You don’t see it in the experiment scorecard, i.e., the metrics look similar

Solutions are not obvious Deploy experiments to subset of machines to look for impact on the machines?

This would work if there were a few experiments, but there are 300 concurrent experiments running Deploy to single data center and then scale out. This is what we do today, but the crashes took several

hours to impact the overall system, so the near-real-time monitoring did not catch this

Ronny Kohavi

14

Challenges not RaisedOEC is the Overall Evaluation CriterionDesiderata:oA single (weighted) metric that determines whether to ship a new feature (or not)oMeasurable using short-term metrics, but predictive of long-term goals (e.g., lifetime value)

What are properties of good OECs?oHard example: support.microsoft.com as a web site.

Is more time better or worse? oIs there an OEC for Word? Excel? Or do these have to be feature specific?

Experiments in social-network settings (Facebook, LinkedIn).Papers are being published on these.I have very limited experience

User identifiers changing (unlogged-in-user logs in)See Ya Xu’s papers in the main conference and Social Recommender Systems workshop:

A/B testing challenges in social networksRonny Kohavi

http://www.comp.hkbu.edu.hk/~fwang/files/srs_2015_papers/talk-ya.pdf

Documents

Five Challenging Problems for A/B/n Tests Slides at (Follow-on talk to KDD 2015 keynote on Online Controlled Experiments: Lessons