12
From Wikipedia: “Parametric statistics is a branch of statistics that assumes (that) data come from a type of probability distribution and makes inferences about the parameters of the distribution. Most well-known elementary statistical methods (e.g. the ones from our class) are parametric.” But there are alternative methods that don’t require any assumptions about the shape of the population’s probability distribution. Resampling methods are an example. Resampling Methods

Resampling Methods

  • Upload
    tracey

  • View
    71

  • Download
    3

Embed Size (px)

DESCRIPTION

Resampling Methods. From Wikipedia : “Parametric statistics is a branch of statistics that assumes (that) data come from a type of probability distribution and makes inferences about the parameters of the distribution. - PowerPoint PPT Presentation

Citation preview

Page 1: Resampling  Methods

From Wikipedia: “Parametric statistics is a branch of statistics that assumes (that) data come from a type of probability distribution and makes inferences about the parameters of the distribution.

Most well-known elementary statistical methods (e.g. the ones from our class) are parametric.”

But there are alternative methods that don’t require any assumptions about the shape of the population’s probability distribution. Resampling methods are an example.

Resampling Methods

Page 2: Resampling  Methods

There are three kinds of resampling methods:

Permutation methods – used most commonly with correlations where the probability of the observed data is estimated by comparing the observed parings to a large number of random parings of the data.

Monte Carlo methods – estimate the population probability distribution through simulation.

Bootstrap methods – the population distribution of an observed statistic is estimated by repeatedly resampling the data with replacement and calculating the statistic.

Page 3: Resampling  Methods

Example of a permutation method: Suppose you measured the IQ’s of 25 pairs of twins and found a correlation of r=0.36. The scatter plot of your data is shown below. Is the observed correlation significantly greater than zero? (use a = .01)

60 80 100 120

60

80

100

Correlation r = 0.36

IQ Twin 1

IQ T

win

2

The (parametric) test used in our class would have found an rcrit value of 0.330We would reject H0 and conclude that a correlation 0.36 is (barely) significantly greater than zero.

Page 4: Resampling  Methods

The distribution under the null hypothesis can be estimated by repeatedly shuffling (or ‘permuting’) the relationship between the X and Y values and calculating the correlation:

97 59 89 91 81 45 85 72 70 43

105 84 81 74

107 77 84 105 93 64 58 87 69 73 99 79 70 89 89 84 75 77 78 87 60 84 61 94 95 92 79 97 68 88 69 85 93 74

79 105

X Y’

r = -.12

97 79 89 85 81 88 85 84 70 77

105 84 81 105 107 72 84 43 93 77 58 73 69 97 99 89 70 92 89 45 75 64 78 91 60 94 61 87 95 74 79 74 68 84 69 87 93 59

79 105

X Y’

r = -.26

97 89 89 87 81 91 85 87 70 59

105 88 81 97

107 94 84 45 93 77 58 73 69 84 99 74 70 79

89 105 75 92 78 84 60 64 61 77 95 84 79 72 68 74 69 85

93 105 79 43

X Y

r = .36

97 64 89 72

81 105 85 73 70 91

105 97 81 84

107 92 84 77 93 74 58 77 69 59 99 85

70 105 89 84 75 43 78 74 60 84 61 45 95 87 79 94 68 89 69 87 93 79 79 88

X Y’

r = .20

Page 5: Resampling  Methods

r= 0.05 r=-0.01 r=-0.00 r=-0.22 r= 0.25 r= 0.05 r=-0.32

r=-0.47 r=-0.34 r=-0.18

r=-0.01 r= 0.31 r=-0.20 r=-0.25 r= 0.15 r=-0.37 r=-0.11

r=-0.24 r=-0.38 r=-0.36 r=-0.26 r=-0.30 r=-0.09 r=-0.24

r= 0.07 r= 0.05 r= 0.13 r=-0.05 r=-0.16 r= 0.02 r=-0.17

r= 0.11 r=-0.12 r=-0.19 r=-0.01

This generates a distribution of correlations that should be centered around zero.

We can then use this distribution to calculate the probability of making our observed sample correlation.

Page 6: Resampling  Methods

-0.6 -0.4 -0.2 0 0.2 0.4 0.6Permuted correlation (r)

After 100000 reps, Pr(r> 0.36)= 0.0378

Only 3.78% of the correlations generated by permutation exceeds the observed correlation of 0.36, so we’d reject the null hypothesis using a = .05

Page 7: Resampling  Methods

Example of a Monte Carlo simulation: Liar’s dice

This is a game where n players roll 40 6-sided dice and keep the outcome hidden under their own separate cups. The goal is to guess how many dice equal the mode. After a player makes a guess, the next player must decide if the guess is too high, or otherwise guess a higher number. If it is decided that the guess is too high, the cups are lifted and the number of dice equal to the mode is computed. If the he/she wins and the player that made the guess must drink (lemonade).

Suppose there are eight players, each with 5 dice. The player to your right just guessed that the modal value is 14. What is the probability that the mode of the 40 dice is that high or higher?

Here’s an example of 40 throws. The mode is 5, and 10 of these throws equals the mode.

3 13mode #

Page 8: Resampling  Methods

Example of 20 simulations. Each row is a throw of 40 dice. The last column is the number of throws that equal the mode.

1 5 122 1 83 1 84 2 95 2 116 3 97 2 108 3 129 3 810 3 911 2 1012 4 1213 6 1314 2 915 2 1116 1 1117 6 1018 2 1219 3 1020 2 9

rep # mode #

Page 9: Resampling  Methods

A computer simulation of one million rolls generated this histogram. Shown in red are the examples when the number of dice equal to the mode is 14 or higher.

Only 2.31% of the simulations found a count of 14 or higher. This small number means that the player should ask all players to lift their cups and calculate the value.

10 15 200

10

20

30

Mode of 40 dice

Per

cent

of r

olls

Page 10: Resampling  Methods

Third method of resampling: bootstrapping to conduct a hypothesis test on medians.

Suppose you measured the amount of time it takes for a subject to perform a simple mental rotation. Previous research shows that it should take a median of 2 seconds to conduct this task. Your subject conducts 500 trials and generates the distribution of response times below, which has a median of 2.15 seconds. Is this number significantly greater than 2? (use a = .05)

0 5 10 15 20 25Response Time (sec)

median = 2.15 (sec)

Page 11: Resampling  Methods

The trick to bootstrapping is to generate an estimate of the sampling distribution of your observed statistic by repeatedly sampling the data with replacement and recalculating the statistic.

0 5 10 15 20 25

median = 2.22

0 5 10 15 20 25

median = 2.23

0 5 10 15 20 25

median = 2.22

0 5 10 15 20 25

median = 2.10

0 5 10 15 20 25

median = 2.21

0 5 10 15 20 25

median = 2.20

0 5 10 15 20 25

median = 2.19

0 5 10 15 20 25

median = 1.98

0 5 10 15 20 25

median = 2.08

0 5 10 15 20 25

median = 2.21

0 5 10 15 20 25

median = 2.07

0 5 10 15 20 25

median = 2.22

For our example, we can count the proportion of times that the median falls below 2.

Page 12: Resampling  Methods

1.6 1.8 2 2.2 2.4 2.6Bootstrapped median

After 1000000 reps, Pr(median < 2.00)= 0.0620

Since more than 5% of our bootstrapped medians fall below 2, we (just barely) cannot conclude that our observed median is significantly greater than 2.