Upload
lyquynh
View
224
Download
0
Embed Size (px)
Citation preview
Sec 4.6 - VariablesVariable: takes on different values (or attributes)
Random variable: cannot be predicted with certainty
Recall: We want to know the probability of observing a particular sample
Random Variables
Qualitative Quantitative
Discrete Continuous
eg. political affiliation,
color preference, gender
measureable, numeric outcomes
eg. # heads tossed,
enrollement
eg. Age of marriage,
income tax return amts,
height
4.7 – Probability Distributions for Discrete RVs
Discrete random variable: quantitative random variable, the variable can
only assume a countable number of values
Probability Distribution of y: theoretical relative frequencies obtained from
the probabilities for each value of y
The probability distribution for a discrete r.v. y, displays the probability P(y)
associated with each value of y.
What is the probability associated with
each value of the variable, y?
Probability Distributions – Discrete RVs
Example. Consider the tossing of 2 coins, and define the variable, y, to be the number
of heads observed. Possible values of y: 0, 1, 2.
y freq
0 129
1 242
2 129
Suppose that empirical sampling
yields the following:
y freq rel. freq
0 129 0.258
1 242 0.484
2 129 0.258
Empirical probability distribution of y:
y P(y)
0 0.25
1 0.5
Theoretical probability distribution
of y:
**Theoretical and empirical
probability distributions
4.9 – Probability Distributions for Continuous RVs
Continuous Random Variable: quantitative, variable assumes values on an
interval, un-countably many possible values
Example. Consider the random variable, y, that is the average height of 18
year old males in the US. The following is sample data collected from 400
individuals:5.4959 5.507 5.5259 5.5698 5.5446 5.4464 5.884 5.837 5.4901 5.4569 5.182 5.6931 4.923 5.9798 5.0576 6.4278 6.1558 6.6181 6.0048 6.1135
5.1775 6.1842 6.2378 6.0983 6.0165 6.1591 5.4195 5.5411 5.7411 5.6197 5.3421 5.8045 5.665 6.0323 5.8679 5.9166 6.0485 5.1919 5.8154 5.0156
5.5252 5.7821 5.355 5.6197 5.3412 6.1074 5.6618 5.8685 5.8428 5.4685 5.758 5.6823 5.7863 5.4616 5.7128 5.8254 5.8888 5.6631 6.4617 5.8419
5.5149 5.2276 5.4401 6.2809 5.8324 6.0809 4.9667 5.9412 6.2718 5.5195 5.5634 5.1731 6.2311 5.7405 5.7851 5.5214 6.027 5.0959 5.5863 5.55
5.8677 5.3949 5.8159 5.3006 5.7134 5.6737 6.0842 5.6256 6.316 6.0855 6.1686 5.2436 5.4665 6.5448 5.9669 5.7581 5.8026 6.0079 5.3411 5.9654
6.0338 6.0263 5.0646 6.3141 6.059 5.6471 5.764 6.3245 5.3717 5.2129 5.9169 5.9424 5.4851 5.4272 5.6306 5.7162 5.7367 5.7482 6.266 5.1307
5.7611 5.1296 5.8472 5.7218 5.9569 5.4853 5.0979 5.8701 5.6827 5.6347 5.1582 5.8158 5.1913 5.8076 4.9118 5.8247 5.6585 5.4951 5.8124 5.6896
6.0666 5.5501 5.5753 6.0568 5.0824 5.9461 6.0266 5.1727 4.9793 5.6128 5.4857 6.2163 5.6608 6.1057 5.2619 5.5251 5.7406 5.7582 5.4758 5.4328
5.4425 6.0701 5.4692 5.8255 5.5485 6.0436 5.2806 6.2656 6.0661 5.7243 5.8049 6.1042 5.6512 5.635 5.7107 5.1302 5.95 6.1118 5.2903 5.3639
6.0563 5.5281 5.443 6.2666 5.6261 5.6967 5.8427 5.4449 5.5194 5.6584 6.1407 5.941 6.1833 4.8951 5.785 5.5433 5.2857 5.9222 6.0596 5.9524
6.0389 5.8492 5.531 6.1674 5.8486 5.8822 5.6159 5.6625 6.0852 5.4245 5.7264 4.9846 5.2148 6.4544 5.8351 6.3308 6.2109 5.6398 5.6678 5.5356
5.8694 5.6393 5.5884 6.0101 6.021 6.2048 5.7914 5.8772 6.1343 5.7689 5.7496 5.9386 5.5588 5.8282 6.0254 6.1923 5.4785 5.8039 5.7008 6.4147
5.8676 6.0046 5.7402 5.7745 5.8013 6.1333 4.8571 4.9746 5.9478 5.7179 5.79 6.1722 5.8119 5.799 5.7891 5.6666 6.1177 5.9385 5.5016 5.9354
5.657 6.1379 6.3875 5.7285 6.0271 5.8701 5.7518 5.5297 5.9275 5.8168 6.0218 5.7141 5.7858 5.2734 5.1043 5.7719 6.1106 5.4786 5.7649 5.8087
5.5939 4.88 6.1127 5.1014 5.0287 5.4296 5.9826 6.0805 5.816 5.95 5.5037 6.0471 5.3983 5.8127 5.8639 5.4055 5.7776 6.4469 5.5847 5.9362
6.0166 5.3819 5.5075 5.6116 6.1283 5.5771 6.0221 5.9787 5.9914 5.7378 6.136 6.2947 5.5923 6.1525 5.4893 5.0933 5.5726 5.1963 5.9892 6.3131
5.5738 6.0115 6.1356 5.8364 6.2263 6.1083 6.147 5.6123 5.9585 5.561 5.9231 6.1162 6.0367 5.0873 6.0336 5.9272 6.0865 5.113 5.6348 5.9155
5.8398 5.8321 5.7265 5.9536 5.8978 5.9475 6.0214 5.8874 6.0786 5.7364 5.7579 5.8132 6.0458 5.8416 5.8506 5.4326 5.6194 6.4342 5.7942 4.8988
5.6871 5.2287 5.9682 6.3543 6.0826 5.4783 6.0511 5.0799 5.8828 5.4756 5.7264 5.4572 6.1518 5.7234 5.8335 5.2863 5.6912 5.3864 5.5351 6.3403
Probability Distribution for Continuous RV
• Example (ctd). The variable values have to be binned – relative frequency
histogram.
The interval lengths and numbers of bins
can be refined … 18 bins here …
40 bins here … with more data, and finer
binning, the histogram outline will approach
a smooth curve.
• 1000 data points.
Smooth curve outline
appears to be
emerging.
• The smooth curve is
the probability
distribution associated
with variable y, the
height of an 18 yr old
male in the US.
Discrete and Continuous Probability Distributions
• Probability distributions provide a means of quantifying the probability of
obtaining a certain sample outcome.
Discrete case:
1. Probability of observing two heads
when a coin is tossed two times
is 0.25.
2. Probability of observing at least one
head is 0.5 + 0.25 = 0.75
Note: Probabilities
are equal to the
fraction of the total
histogram area
corresponding to the
values of interest
Probability of observing
Either no heads or two
Heads is 0.25 + 0.25.
Discrete and Continuous Probability Distributions
Continuous case:
1. Does it make sense to ask “what is the
probability
that an 18 y.o. male is 5’10”?” NO
2. Note: The distribution plot was created using
relative frequencies – total area under the plot
is 1.
3. We compute the probability of a value falling
in a certain range of values, by computing the
area that lies under the distribution plot, over
that range.
The probability that an 18 y.o.
male has a height that lies between
5.7 and 5.8 feet is approx 0.1.
Half-way Summary
• So far:
1. How to create probability distributions from empirical/theoretical discrete and continuous random variables.
2. How to determine probabilities of a variable attaining a certain value (discrete) or attaining a value that lies within a certain range (continuous).
3. Why is this useful? (Q: what is the probability of obtaining a particular sample)
4. Some common known distributions – bionomial (discrete), normal (continuous), t-distribution (continuous), chi-squared (continuous)
5. Can make assumptions about the type of distribution associated with particular populations of interest – one of the known distributions
6. Can determine features of the underlying distributions by simulation, other empirical observations
The Binomial Distribution - Discrete
Binomial Distribution properties:
1. experiment has n identical trials
2. each trial is either a success or failure (2 possible outcomes)
3. P(success) = π for every trial, fixed
4. trials are independent
5. variable, y = # of successes in the n trials
Examples.
1. y = # heads when a coin is tossed n times (success = heads)
2. y = # light bulbs that fail inspection when n selected from a batch are
tested (success = failed inspection)
3. y = # of people who test positive for a bacterial infection out of n who
have been exposed to the bacteria (success = positive test result)
Outcome of one trial does not
affect the outcome of any
other(s)
• P(y) = probability of obtaining y successes in n trials of a binomial exp
Example (Computing P(y)). Suppose there is a 25% chance that a pregnancy
test fails. What is the probability that out of a sample of 5 tests, all 5 fail?
i.e. What is P(5)? P(5) = P(the 1st test fails and the 2nd test
fails and the 3rd test fails and …
and the 5th test fails)
Now, what is P(2)?
The Binomial Distribution (ctd)
5)25.0()25.0(*...*)25.0(*)25.0()20( P
000977.0
• What is P(2)? P(2) = P(1st fails and 2nd fails and rest don’t OR
1st fails and 3rd fails and rest don’t OR …)
)25.0)(25.0)(75.0)(75.0)(75.0(....
)75.0)(75.0)(25.0)(75.0)(25.0(
)75.0)(75.0)(75.0)(25.0)(25.0()2(
P
2637.0
75.025.0!2!3
!5
)75.0()25.0(2
5)2(
32
32
P
The Binomial Distribution (ctd)
P(2) = (# ways to select 2 failing tests out of 5)*
(probability of 2 test failing)*(probability of
3 tests not failing)
= 5C2*0.252*0.753
The Binomial Distribution (ctd)
)()1(
)!(!
!)(
yny
yny
nyP
Probability of y successes in n trials of a binomial experiment:
y = # successes in n trials
n = # trials
π = probability of success on
a single trial
Mean and Standard Deviation of the Binomial Distribution:
n )1( nMean:Standard
Deviation:
• Example. What is the probability that 6 out of 20 tests fail, if the
probability that any one test fails is 25%? Success = test fails
So, π = 0.25, n = 20, y = 6
1686.0
75.025.01*2*3*4*5*6
15*16*17*18*19*20
75.025.0!14!6
!20)6(
146
146
P
Note: P(y ≥ 7) = P(7) + P(8) + P(9) + … + P(20)
= 1 – P(y ≤ 6)
• What are the mean and deviation of this distribution?
5
25.0*20
94.1
)75.0(25.0*20
The Binomial Distribution (ctd)
• Bell-shaped curve, symmetric
about mean
• Numerous continuous random
variables have a normal
distribution – eg. test scores,
weight, 100m sprint times
• Normal curve is defined by μ
and σ
• Empirical rule holds: approx
68% of the population lies
within ± 1σ of μ
• P(y1 ≤ y < y2) = area under
normal curve between y=y1
and y=y2
The Normal Distribution - Continuous
Normal
curve, f(y)
222)(
2
1)(
yeyf
• Computing probabilities for normally distributed populations:
The Normal Distribution
222)(
2
1)(
yeyf
2
1
222)(
2
1
21
2
1
)()(
y
y
y
y
y
e
yfyyyP
P(5.5 ≤ x <5.7) = 0.1844
Computing probabilities (ctd):
- Normal curves vary by variable values (x-axis), depend on μ and σ, but are
identical in shape
- Standard normal distribution: μ = 0 and σ = 1
- Tables exist for areas under this
graph (Table 1, Appendix of text)
-
The Normal Distribution – Standard Normal
In a standards normal
distribution, these are known as z-
values
x values between z = 0.5 and
z = 1.1 are measurements that lie
between 0.5 and 1.1 standard deviations
away from the mean of 0.
• Table 1 contains areas under the
standard normal curve that lie to the
left of a particular z-value.
• i.e. Reading the entry
corresponding to z1 we obtain
P(x < z1)
The Normal Distribution – Reading from the table
So
P(0.5 ≤ x < 1.1) = P(x < 1.1) - P( x < 0.5)
= 0.8643 - 0.6915
= 0.1728
z-values
P(z<0.5)
P(z<1.1)
P(0.5 ≤ z<1.1)
• We can use Table 1 for arbitrary normal distributions, as long μ and σ are
known.
• This is done by standardizing the measurement values, y, to standard
normal values known as z-scores:
Example. Consider a normal distribution with μ = 25 and σ = 3.5. Compute
the probability that the value of a measurement lies between 27 and 30.
The Normal Distribution – Z-scores
yz
)5714.0()4286.1()5.3
2530
5.3
2527()3027(
zPzPzPyP
y1 y2 z1 z22079.0
7157.09236.0
There is a 20.79% probability that y takes a value between 27 and 30.
• Def: The 100pth percentile of a distribution is the value yp such that
100p% of the population values lie below yp and 100(1-p)% lie above yp.
• To find percentiles of standard normal distribution –
– reverse lookup of Table 1
Example. Find the 33rd percentile of the standard normal distribution.
Need to find zp such that 100p% of values lies below zp. I.e. Find zp such that
P(z ≤ zp) = 33%
From Table 1: zp = -0.44
So, 33rd percentile is -0.44
The Normal Distribution – Percentiles
• To apply this idea to general normal distributions, we do a reverse
standardizing:
• The 100pth percentile is yp such that 100p% of measurements lie below yp.
• I.e. P(yp ≤ y) = 100p% we can find the z-score associated
with 100p%, and convert it back to y-values using:
• Example. For the normal distribution with μ =5.75 and σ = 0.4, find the
40th percentile.
• From Table 1, zp = -0.25
• yp = 5.75 + (-0.25)*0.4 = 5.65
• The 40th percentile of this distribution
is is 5.65.
The Normal Distribution – Percentiles
pp zy