Click here to load reader

A/B Testing (Hypothesis Testing) - Purdue University · PDF fileA/B Testing (Hypothesis Testing) CS57300 - Data Mining Spring 2016 Instructor: Bruno Ribeiro ... “There maybe discount

  • View
    212

  • Download
    0

Embed Size (px)

Text of A/B Testing (Hypothesis Testing) - Purdue University · PDF fileA/B Testing (Hypothesis...

  • A/B Testing(Hypothesis Testing)

    CS57300 - Data MiningSpring 2016

    Instructor: Bruno Ribeiro

    2016 Bruno Ribeiro

  • } Select 50% users to see headline A Unlimited Clean Energy: Cold

    Fusion has Arrived

    } Select 50% users to see headline B Wedding War

    } Do people click more on headline A or B?

    A/B Testing

    2 2016 Bruno Ribeiro

  • } Can you guess which page has a higher conversion rate and whether the difference is significant?

    } When upgraded from the A to B the site lost 90% of their revenue} Why? There maybe discount coupons out there that I do not have. The

    price may be too high. I should try to find these coupons. [Kumar et al. 2009]

    3

    A/B Testing on Websites

    A B Kumar et al. 2009

    2016 Bruno Ribeiro

  • Testing Hypotheses over Two Populations

    12

    ?

    ?

    1049

    ?

    ?

    34

    ?

    Are the averages different?Which one has the largest average?

    ?

    ?

    4

    Average 1 Average 2

    Population 2Population 1

    2016 Bruno Ribeiro

  • The two-sample t-test

    Is difference in averages between two groups more than we would expect based on chance alone?

    PS: Same as alien identification problem: we dont know how to model average is different

    5 2016 Bruno Ribeiro

  • t-Test (Independent Samples)

    H0: 1 - 2 = 0

    H1: 1 - 2 0

    The goal is to evaluate if the average difference between two populations is zero

    In the t-test we make the following assumptions

    The averages and follow a normal distribution (we will see why)

    Observations are independent

    Two hypotheses:

    6

    population 1 average

    vectors

    2016 Bruno Ribeiro

    X(1) = random variable of population 1 values

    X(2) = random variable of population 2 values

    X(1)

    X(2)

  • General t formula

    t = sample statistic - hypothesized population differenceestimated standard error

    Independent samples tEmpirical averages

    Empirical standard deviation (formula later)

    t-Test Calculation

    7 2016 Bruno Ribeiro

    t =(x(1) x(2)) (1 2)

    SE

  • } What is the p-value?

    } Can we test H1?

    } Can we ever directly accept hypothesis H1 ? No, we cant test H1, we can only reject H0 in favor of H1

    8

    t-Statistics p-valueH0: 1 - 2 = 0

    H1: 1 - 2 0

    2016 Bruno Ribeiro

    P [X(1,n1) X(2,n2) > x(1,n1) x(2)|H0] = p

    x

    (i)= empirical average of population i

    P [X(1,n1) X(2,n2) > x(1,n1) x(2)|H1] = 1 p?

    Random variables

  • 9

    R code

    x1

  • Null hypothesis H0 Alternative hypothesis H1 No. Tails1 - 2 = d 1 - 2 d 21 - 2 = d 1 - 2 < d 11 - 2 = d 1 - 2 > d 1

    10

    Two Sample Tests (Fisher)

    12

    ?

    ?

    109

    ?

    ?

    4

    ??

    ?

    Average 1 Average 2

    2016 Bruno Ribeiro

  • } E.g. software updates Perform incremental A/B testing before rolling ANY big system

    change on a website that should have no effect on users(even if users dont directly see the change) What is the hypothesis we want to test? H0 = no difference in [engagement, purchases, delay, transaction

    time,]

    How? Start with 0.1% of visitors (machines) and

    grow until 50% of visitors (machines) If at any time H0 is rejected, stop the roll out Must account for testing multiple hypotheses (next class)

    (more precisely, this is sequential analysis)

    11

    Less Obvious Applications

    2016 Bruno Ribeiro

  • } How to stop experiment early if hypothesis seems true Stopping criteria often needs to be decided before experiment starts More next class

    12

    Sequential Analysis (Sequential Hypothesis Test)

    2016 Bruno Ribeiro

  • } Fishers test Test can only reject H0 (we never accept a hypothesis) H0 is likely wrong in real-life, so rejection depends on the amount of data More data, more likely we will reject H0

    } Neyman-Pearsons test Compare H0 to alternative H1 E.g.: H0: = 0 and H1: = 1 P[Data | H0 ] / P[Data | H1]

    } Bayesian test Compute probability P[H0 | Data] and compare against P[H1 | Data] More precisely, test P[H0 | Data] / P[H1 | Data] >1 implies H0 is more likely

  • 14

    Back to Fishers test(no priors)

    2016 Bruno Ribeiro

  • 1) Compute the empirical standard error

    where,

    and

    (assumes both populations have equal variance)15

    How to Compute Two-sample t-test (1)

    Sample variance of x(i)

    Number of observations in x(i)

    s

    2i =

    1

    ni

    niX

    k=1

    (x(i)k x(i))2

    2016 Bruno Ribeiro

    xi =1

    ni

    niX

    m=1

    x

    (i)m

    SE =

    s1

    n1+

    1

    n2

    (n1 1)s21 + (n2 1)s22

    n1 + n2 2

  • 2) Compute the degrees of freedom

    3) Compute test statistic (t-score, also known as Welshs t)

    where d is the Null hypothesis difference.4) Compute p-value (depends on H1) p = P[TDF < -|td|] + P[TDF > |td|] (Two-Tailed Test H1: 1 - 2 d)

    p = P[TDF > td] (One-Tailed Test for H1 : 1 - 2 > d) Important: H0 is always 1 - 2 = d even when H1 : 1 - 2 > d !!

    Testing H0: 1 - 2 d is harder and has same power as H0:1 - 2 = d

    What is the distribution of TDF?} I dont know (majority answer)} I dont know (the true answer) 16

    How to Compute Two-sample t-test (2)

    DF =

    $ 21/n1 +

    22/n2

    2

    (21/n1)2/(n1 1) + (22/n2)2/(n2 1)

    %

    td =(x1 x2) d

    SE

    2016 Bruno Ribeiro

  • } Back to step 4 of slide 16:

    17

    Rejecting H0 in favor of H1

    4) Compute p-value (depends on H1)

    p = P[TDF < -|td|] + P[TDF > |td|] (Two-Tailed Test H1: 1 - 2 d)

    p = P[TDF > td] (One-Tailed Test for H1 : 1 - 2 > d)

    Reject H0 with 95% confidence if p < 0.05

    p=0.025 p=0.025 p=0.05

    2016 Bruno Ribeiro

  • }

    }

    } Observations of X1 and X2 are independent and identically distributed (i.i.d.)

    } Central Limit Theorem (Classical CLT) If: E[Xk(i)] = i and Var[Xk(i)] = i2 <

    } More generally, the real CLT is about stable distributions18

    Some assumptions about X1 and X2

    (here is with respect to ni)

    pni

    1

    ni

    nX

    k=1

    x

    (i)k

    ! i

    !d! N(0,2i )

    2016 Bruno Ribeiro

    X(2) = [X(2)1 ,X(2)2 , . . . ,X

    (2)n2 ]

    X(1) = [X(1)1 ,X(1)2 , . . . ,X

    (1)n1 ]

  • 19

    CLT: If we have enough independent observations with small variance we can approximate the distribution of their average with a normal distribution

    2016 Bruno Ribeiro

  • } approximation not too useful if we dont know

    } We can estimate with ni observations of

    } But we cannot just plug-in estimate on the normal It has some variability if ni < is Chi-Squared distributed The t-distribution is a convolution of the standard normal

    with a Chi-Square distribution to compute

    20

    * But we dont know the variance of X(1) or X(2)

    N(0,2i ) 2i

    2i N(0,2i )

    2i

    2i

    t =ip

    2i /DF

    2016 Bruno Ribeiro

  • } If results are 0 or 1 (buy, not buy) we can use Bernoulli random variables rather than the Normal approximation

    21

    For small samples we can use the Binomial distribution

    2016 Bruno Ribeiro

  • 22

    What about false positives and

    false negatives of a test?

    2016 Bruno Ribeiro

  • 23

    Hypothesis Test Possible Outcomes

    Type I error (false positive)

    P [H0|H0] P [H0|H0]

    P [H0|H0]Type II error (false negative)

    P [H0|H0]

    P [H0|H0] - Reject H0 given H0 is trueP [H0|H0] - Accept H0 given H0 is false

    Errors:

    In medicine our goal is to reject H0(drug, food has no effect / not sick), thus a positive result rejects H0

    2016 Bruno Ribeiro

  • } Statistical power is probability of rejecting H0when H0 is indeed false

    } Statistical Power Number of Observations Needed} Standard value is 0.80 but can go up to 0.95

    } E.g.: H0 is 1 - 2 = 0 , where i = true average of population i Define n = n1 = n2 such that statistical power is 0.8 under

    assumption |1 - 2| = : P[Test Rejects | |1 - 2| = ] = 0.8

    where Test Rejects = 1{P[x(1) , x(2) | 1 - 2 = 0] < 0.05}which gives

    24

    Statistical Powerpower = P [H0|H0]

    n =162

    2 2016 Bruno Ribeiro

  • More Broadly: Hypothesis Testing Procedures

    Hypothesis Testing

    Procedures

    Parametric

    Z Test t Test Cohen's d

    Nonparametric

    Wilcoxon Rank Sum

    Test

    Kruskal-WalliH-Test

    Kolmogorov-Smirnov test

    25 2016 Bruno Ribeiro

  • Parametric Test Procedures

    } Tests Population Parameters (e.g. Mean)

    } Distribution Assumptions (e.g. Normal distribution)

    } Examples: Z Test, t-Test, c2 Test, F test

    26 2016 Bruno Ribeiro

  • 27 2016 Bruno Ribeiro

  • General t formula

    t = sample statistic - hypothesized population differenceestimated standard error

    Independent samples tEmpirical averages

    Estimated standard deviation

    Testing Effect Sizes

    28

    t =(x(1) x(2)) (1 2)

    SE

    t-Test tests only if the difference is zero or not?

    Solution? Homework 2 2016 Bruno Ribeiro

  • Cohens d often used to complement t-test when reporting effect sizes

    where S is the pooled variance

    Effect Size: Good practice

    29

    d =x

    (1) x(2)

    S

    S =

    s(n1 1)s21 + (n2 1)s22

    n1