- Home
- Documents
*A/B Testing (Hypothesis Testing) - Purdue University · PDF fileA/B Testing (Hypothesis...*

Click here to load reader

View

212Download

0

Embed Size (px)

A/B Testing(Hypothesis Testing)

CS57300 - Data MiningSpring 2016

Instructor: Bruno Ribeiro

2016 Bruno Ribeiro

} Select 50% users to see headline A Unlimited Clean Energy: Cold

Fusion has Arrived

} Select 50% users to see headline B Wedding War

} Do people click more on headline A or B?

A/B Testing

2 2016 Bruno Ribeiro

} Can you guess which page has a higher conversion rate and whether the difference is significant?

} When upgraded from the A to B the site lost 90% of their revenue} Why? There maybe discount coupons out there that I do not have. The

price may be too high. I should try to find these coupons. [Kumar et al. 2009]

3

A/B Testing on Websites

A B Kumar et al. 2009

2016 Bruno Ribeiro

Testing Hypotheses over Two Populations

12

?

?

1049

?

?

34

?

Are the averages different?Which one has the largest average?

?

?

4

Average 1 Average 2

Population 2Population 1

2016 Bruno Ribeiro

The two-sample t-test

Is difference in averages between two groups more than we would expect based on chance alone?

PS: Same as alien identification problem: we dont know how to model average is different

5 2016 Bruno Ribeiro

t-Test (Independent Samples)

H0: 1 - 2 = 0

H1: 1 - 2 0

The goal is to evaluate if the average difference between two populations is zero

In the t-test we make the following assumptions

The averages and follow a normal distribution (we will see why)

Observations are independent

Two hypotheses:

6

population 1 average

vectors

2016 Bruno Ribeiro

X(1) = random variable of population 1 values

X(2) = random variable of population 2 values

X(1)

X(2)

General t formula

t = sample statistic - hypothesized population differenceestimated standard error

Independent samples tEmpirical averages

Empirical standard deviation (formula later)

t-Test Calculation

7 2016 Bruno Ribeiro

t =(x(1) x(2)) (1 2)

SE

} What is the p-value?

} Can we test H1?

} Can we ever directly accept hypothesis H1 ? No, we cant test H1, we can only reject H0 in favor of H1

8

t-Statistics p-valueH0: 1 - 2 = 0

H1: 1 - 2 0

2016 Bruno Ribeiro

P [X(1,n1) X(2,n2) > x(1,n1) x(2)|H0] = p

x

(i)= empirical average of population i

P [X(1,n1) X(2,n2) > x(1,n1) x(2)|H1] = 1 p?

Random variables

9

R code

x1

Null hypothesis H0 Alternative hypothesis H1 No. Tails1 - 2 = d 1 - 2 d 21 - 2 = d 1 - 2 < d 11 - 2 = d 1 - 2 > d 1

10

Two Sample Tests (Fisher)

12

?

?

109

?

?

4

??

?

Average 1 Average 2

2016 Bruno Ribeiro

} E.g. software updates Perform incremental A/B testing before rolling ANY big system

change on a website that should have no effect on users(even if users dont directly see the change) What is the hypothesis we want to test? H0 = no difference in [engagement, purchases, delay, transaction

time,]

How? Start with 0.1% of visitors (machines) and

grow until 50% of visitors (machines) If at any time H0 is rejected, stop the roll out Must account for testing multiple hypotheses (next class)

(more precisely, this is sequential analysis)

11

Less Obvious Applications

2016 Bruno Ribeiro

} How to stop experiment early if hypothesis seems true Stopping criteria often needs to be decided before experiment starts More next class

12

Sequential Analysis (Sequential Hypothesis Test)

2016 Bruno Ribeiro

} Fishers test Test can only reject H0 (we never accept a hypothesis) H0 is likely wrong in real-life, so rejection depends on the amount of data More data, more likely we will reject H0

} Neyman-Pearsons test Compare H0 to alternative H1 E.g.: H0: = 0 and H1: = 1 P[Data | H0 ] / P[Data | H1]

} Bayesian test Compute probability P[H0 | Data] and compare against P[H1 | Data] More precisely, test P[H0 | Data] / P[H1 | Data] >1 implies H0 is more likely

14

Back to Fishers test(no priors)

2016 Bruno Ribeiro

1) Compute the empirical standard error

where,

and

(assumes both populations have equal variance)15

How to Compute Two-sample t-test (1)

Sample variance of x(i)

Number of observations in x(i)

s

2i =

1

ni

niX

k=1

(x(i)k x(i))2

2016 Bruno Ribeiro

xi =1

ni

niX

m=1

x

(i)m

SE =

s1

n1+

1

n2

(n1 1)s21 + (n2 1)s22

n1 + n2 2

2) Compute the degrees of freedom

3) Compute test statistic (t-score, also known as Welshs t)

where d is the Null hypothesis difference.4) Compute p-value (depends on H1) p = P[TDF < -|td|] + P[TDF > |td|] (Two-Tailed Test H1: 1 - 2 d)

p = P[TDF > td] (One-Tailed Test for H1 : 1 - 2 > d) Important: H0 is always 1 - 2 = d even when H1 : 1 - 2 > d !!

Testing H0: 1 - 2 d is harder and has same power as H0:1 - 2 = d

What is the distribution of TDF?} I dont know (majority answer)} I dont know (the true answer) 16

How to Compute Two-sample t-test (2)

DF =

$ 21/n1 +

22/n2

2

(21/n1)2/(n1 1) + (22/n2)2/(n2 1)

%

td =(x1 x2) d

SE

2016 Bruno Ribeiro

} Back to step 4 of slide 16:

17

Rejecting H0 in favor of H1

4) Compute p-value (depends on H1)

p = P[TDF < -|td|] + P[TDF > |td|] (Two-Tailed Test H1: 1 - 2 d)

p = P[TDF > td] (One-Tailed Test for H1 : 1 - 2 > d)

Reject H0 with 95% confidence if p < 0.05

p=0.025 p=0.025 p=0.05

2016 Bruno Ribeiro

}

}

} Observations of X1 and X2 are independent and identically distributed (i.i.d.)

} Central Limit Theorem (Classical CLT) If: E[Xk(i)] = i and Var[Xk(i)] = i2 <

} More generally, the real CLT is about stable distributions18

Some assumptions about X1 and X2

(here is with respect to ni)

pni

1

ni

nX

k=1

x

(i)k

! i

!d! N(0,2i )

2016 Bruno Ribeiro

X(2) = [X(2)1 ,X(2)2 , . . . ,X

(2)n2 ]

X(1) = [X(1)1 ,X(1)2 , . . . ,X

(1)n1 ]

19

CLT: If we have enough independent observations with small variance we can approximate the distribution of their average with a normal distribution

2016 Bruno Ribeiro

} approximation not too useful if we dont know

} We can estimate with ni observations of

} But we cannot just plug-in estimate on the normal It has some variability if ni < is Chi-Squared distributed The t-distribution is a convolution of the standard normal

with a Chi-Square distribution to compute

20

* But we dont know the variance of X(1) or X(2)

N(0,2i ) 2i

2i N(0,2i )

2i

2i

t =ip

2i /DF

2016 Bruno Ribeiro

} If results are 0 or 1 (buy, not buy) we can use Bernoulli random variables rather than the Normal approximation

21

For small samples we can use the Binomial distribution

2016 Bruno Ribeiro

22

What about false positives and

false negatives of a test?

2016 Bruno Ribeiro

23

Hypothesis Test Possible Outcomes

Type I error (false positive)

P [H0|H0] P [H0|H0]

P [H0|H0]Type II error (false negative)

P [H0|H0]

P [H0|H0] - Reject H0 given H0 is trueP [H0|H0] - Accept H0 given H0 is false

Errors:

In medicine our goal is to reject H0(drug, food has no effect / not sick), thus a positive result rejects H0

2016 Bruno Ribeiro

} Statistical power is probability of rejecting H0when H0 is indeed false

} Statistical Power Number of Observations Needed} Standard value is 0.80 but can go up to 0.95

} E.g.: H0 is 1 - 2 = 0 , where i = true average of population i Define n = n1 = n2 such that statistical power is 0.8 under

assumption |1 - 2| = : P[Test Rejects | |1 - 2| = ] = 0.8

where Test Rejects = 1{P[x(1) , x(2) | 1 - 2 = 0] < 0.05}which gives

24

Statistical Powerpower = P [H0|H0]

n =162

2 2016 Bruno Ribeiro

More Broadly: Hypothesis Testing Procedures

Hypothesis Testing

Procedures

Parametric

Z Test t Test Cohen's d

Nonparametric

Wilcoxon Rank Sum

Test

Kruskal-WalliH-Test

Kolmogorov-Smirnov test

25 2016 Bruno Ribeiro

Parametric Test Procedures

} Tests Population Parameters (e.g. Mean)

} Distribution Assumptions (e.g. Normal distribution)

} Examples: Z Test, t-Test, c2 Test, F test

26 2016 Bruno Ribeiro

27 2016 Bruno Ribeiro

General t formula

t = sample statistic - hypothesized population differenceestimated standard error

Independent samples tEmpirical averages

Estimated standard deviation

Testing Effect Sizes

28

t =(x(1) x(2)) (1 2)

SE

t-Test tests only if the difference is zero or not?

Solution? Homework 2 2016 Bruno Ribeiro

Cohens d often used to complement t-test when reporting effect sizes

where S is the pooled variance

Effect Size: Good practice

29

d =x

(1) x(2)

S

S =

s(n1 1)s21 + (n2 1)s22

n1