Click here to load reader
View
212
Download
0
Embed Size (px)
A/B Testing(Hypothesis Testing)
CS57300 - Data MiningSpring 2016
Instructor: Bruno Ribeiro
2016 Bruno Ribeiro
} Select 50% users to see headline A Unlimited Clean Energy: Cold
Fusion has Arrived
} Select 50% users to see headline B Wedding War
} Do people click more on headline A or B?
A/B Testing
2 2016 Bruno Ribeiro
} Can you guess which page has a higher conversion rate and whether the difference is significant?
} When upgraded from the A to B the site lost 90% of their revenue} Why? There maybe discount coupons out there that I do not have. The
price may be too high. I should try to find these coupons. [Kumar et al. 2009]
3
A/B Testing on Websites
A B Kumar et al. 2009
2016 Bruno Ribeiro
Testing Hypotheses over Two Populations
12
?
?
1049
?
?
34
?
Are the averages different?Which one has the largest average?
?
?
4
Average 1 Average 2
Population 2Population 1
2016 Bruno Ribeiro
The two-sample t-test
Is difference in averages between two groups more than we would expect based on chance alone?
PS: Same as alien identification problem: we dont know how to model average is different
5 2016 Bruno Ribeiro
t-Test (Independent Samples)
H0: 1 - 2 = 0
H1: 1 - 2 0
The goal is to evaluate if the average difference between two populations is zero
In the t-test we make the following assumptions
The averages and follow a normal distribution (we will see why)
Observations are independent
Two hypotheses:
6
population 1 average
vectors
2016 Bruno Ribeiro
X(1) = random variable of population 1 values
X(2) = random variable of population 2 values
X(1)
X(2)
General t formula
t = sample statistic - hypothesized population differenceestimated standard error
Independent samples tEmpirical averages
Empirical standard deviation (formula later)
t-Test Calculation
7 2016 Bruno Ribeiro
t =(x(1) x(2)) (1 2)
SE
} What is the p-value?
} Can we test H1?
} Can we ever directly accept hypothesis H1 ? No, we cant test H1, we can only reject H0 in favor of H1
8
t-Statistics p-valueH0: 1 - 2 = 0
H1: 1 - 2 0
2016 Bruno Ribeiro
P [X(1,n1) X(2,n2) > x(1,n1) x(2)|H0] = p
x
(i)= empirical average of population i
P [X(1,n1) X(2,n2) > x(1,n1) x(2)|H1] = 1 p?
Random variables
9
R code
x1
Null hypothesis H0 Alternative hypothesis H1 No. Tails1 - 2 = d 1 - 2 d 21 - 2 = d 1 - 2 < d 11 - 2 = d 1 - 2 > d 1
10
Two Sample Tests (Fisher)
12
?
?
109
?
?
4
??
?
Average 1 Average 2
2016 Bruno Ribeiro
} E.g. software updates Perform incremental A/B testing before rolling ANY big system
change on a website that should have no effect on users(even if users dont directly see the change) What is the hypothesis we want to test? H0 = no difference in [engagement, purchases, delay, transaction
time,]
How? Start with 0.1% of visitors (machines) and
grow until 50% of visitors (machines) If at any time H0 is rejected, stop the roll out Must account for testing multiple hypotheses (next class)
(more precisely, this is sequential analysis)
11
Less Obvious Applications
2016 Bruno Ribeiro
} How to stop experiment early if hypothesis seems true Stopping criteria often needs to be decided before experiment starts More next class
12
Sequential Analysis (Sequential Hypothesis Test)
2016 Bruno Ribeiro
} Fishers test Test can only reject H0 (we never accept a hypothesis) H0 is likely wrong in real-life, so rejection depends on the amount of data More data, more likely we will reject H0
} Neyman-Pearsons test Compare H0 to alternative H1 E.g.: H0: = 0 and H1: = 1 P[Data | H0 ] / P[Data | H1]
} Bayesian test Compute probability P[H0 | Data] and compare against P[H1 | Data] More precisely, test P[H0 | Data] / P[H1 | Data] >1 implies H0 is more likely
14
Back to Fishers test(no priors)
2016 Bruno Ribeiro
1) Compute the empirical standard error
where,
and
(assumes both populations have equal variance)15
How to Compute Two-sample t-test (1)
Sample variance of x(i)
Number of observations in x(i)
s
2i =
1
ni
niX
k=1
(x(i)k x(i))2
2016 Bruno Ribeiro
xi =1
ni
niX
m=1
x
(i)m
SE =
s1
n1+
1
n2
(n1 1)s21 + (n2 1)s22
n1 + n2 2
2) Compute the degrees of freedom
3) Compute test statistic (t-score, also known as Welshs t)
where d is the Null hypothesis difference.4) Compute p-value (depends on H1) p = P[TDF < -|td|] + P[TDF > |td|] (Two-Tailed Test H1: 1 - 2 d)
p = P[TDF > td] (One-Tailed Test for H1 : 1 - 2 > d) Important: H0 is always 1 - 2 = d even when H1 : 1 - 2 > d !!
Testing H0: 1 - 2 d is harder and has same power as H0:1 - 2 = d
What is the distribution of TDF?} I dont know (majority answer)} I dont know (the true answer) 16
How to Compute Two-sample t-test (2)
DF =
$ 21/n1 +
22/n2
2
(21/n1)2/(n1 1) + (22/n2)2/(n2 1)
%
td =(x1 x2) d
SE
2016 Bruno Ribeiro
} Back to step 4 of slide 16:
17
Rejecting H0 in favor of H1
4) Compute p-value (depends on H1)
p = P[TDF < -|td|] + P[TDF > |td|] (Two-Tailed Test H1: 1 - 2 d)
p = P[TDF > td] (One-Tailed Test for H1 : 1 - 2 > d)
Reject H0 with 95% confidence if p < 0.05
p=0.025 p=0.025 p=0.05
2016 Bruno Ribeiro
}
}
} Observations of X1 and X2 are independent and identically distributed (i.i.d.)
} Central Limit Theorem (Classical CLT) If: E[Xk(i)] = i and Var[Xk(i)] = i2 <
} More generally, the real CLT is about stable distributions18
Some assumptions about X1 and X2
(here is with respect to ni)
pni
1
ni
nX
k=1
x
(i)k
! i
!d! N(0,2i )
2016 Bruno Ribeiro
X(2) = [X(2)1 ,X(2)2 , . . . ,X
(2)n2 ]
X(1) = [X(1)1 ,X(1)2 , . . . ,X
(1)n1 ]
19
CLT: If we have enough independent observations with small variance we can approximate the distribution of their average with a normal distribution
2016 Bruno Ribeiro
} approximation not too useful if we dont know
} We can estimate with ni observations of
} But we cannot just plug-in estimate on the normal It has some variability if ni < is Chi-Squared distributed The t-distribution is a convolution of the standard normal
with a Chi-Square distribution to compute
20
* But we dont know the variance of X(1) or X(2)
N(0,2i ) 2i
2i N(0,2i )
2i
2i
t =ip
2i /DF
2016 Bruno Ribeiro
} If results are 0 or 1 (buy, not buy) we can use Bernoulli random variables rather than the Normal approximation
21
For small samples we can use the Binomial distribution
2016 Bruno Ribeiro
22
What about false positives and
false negatives of a test?
2016 Bruno Ribeiro
23
Hypothesis Test Possible Outcomes
Type I error (false positive)
P [H0|H0] P [H0|H0]
P [H0|H0]Type II error (false negative)
P [H0|H0]
P [H0|H0] - Reject H0 given H0 is trueP [H0|H0] - Accept H0 given H0 is false
Errors:
In medicine our goal is to reject H0(drug, food has no effect / not sick), thus a positive result rejects H0
2016 Bruno Ribeiro
} Statistical power is probability of rejecting H0when H0 is indeed false
} Statistical Power Number of Observations Needed} Standard value is 0.80 but can go up to 0.95
} E.g.: H0 is 1 - 2 = 0 , where i = true average of population i Define n = n1 = n2 such that statistical power is 0.8 under
assumption |1 - 2| = : P[Test Rejects | |1 - 2| = ] = 0.8
where Test Rejects = 1{P[x(1) , x(2) | 1 - 2 = 0] < 0.05}which gives
24
Statistical Powerpower = P [H0|H0]
n =162
2 2016 Bruno Ribeiro
More Broadly: Hypothesis Testing Procedures
Hypothesis Testing
Procedures
Parametric
Z Test t Test Cohen's d
Nonparametric
Wilcoxon Rank Sum
Test
Kruskal-WalliH-Test
Kolmogorov-Smirnov test
25 2016 Bruno Ribeiro
Parametric Test Procedures
} Tests Population Parameters (e.g. Mean)
} Distribution Assumptions (e.g. Normal distribution)
} Examples: Z Test, t-Test, c2 Test, F test
26 2016 Bruno Ribeiro
27 2016 Bruno Ribeiro
General t formula
t = sample statistic - hypothesized population differenceestimated standard error
Independent samples tEmpirical averages
Estimated standard deviation
Testing Effect Sizes
28
t =(x(1) x(2)) (1 2)
SE
t-Test tests only if the difference is zero or not?
Solution? Homework 2 2016 Bruno Ribeiro
Cohens d often used to complement t-test when reporting effect sizes
where S is the pooled variance
Effect Size: Good practice
29
d =x
(1) x(2)
S
S =
s(n1 1)s21 + (n2 1)s22
n1