Upload
antonio-holmes
View
236
Download
2
Tags:
Embed Size (px)
Citation preview
Chapter 4. Elements of Statistics
# brief introduction to some concepts of statistics
# descriptive statistics inductive statistics(statistical inference)
# Classification of the field of statisticsi) Sampling theoryii) Estimation theoryiii) Hypothesis testingiv) Curve fitting or Regressionv) Analysis of variance
4.2 Sampling Theory–the Sample MeanHow many samples are required
for a given degree of confidence in the result?
# Terminology
- population
N(size of population) very large or ∞
- (random) sample
n(size of sample)
# one of the most important quantities is the sample mean
How close the sample mean might be
to the average value of the population?
Let the sample have the numerical value of x1, x2, … xn
Then, the sample mean is given by
Note that we are interested in the statistical properties of
arbitrary random samples rather than any particular sample.
That is, the sample mean becomes a random variable.
Therefore, it is appropriate to denote the sample mean as
n
i
xin
x1
1
n
i
Xin
x1
1
We want the mean value of the sample mean
close to the true mean value of the population
the mean value of the sample mean
= the true mean value of the population
The sample mean is a unbiased estimate of the true mean.
But, this is not sufficient to indicate whether the sample mean is a good estimator of the true population mean.
n
i
n
iiXEn
Xin
EXE1 1
][1
]1
[]ˆ[
XXnn
1
X
The variance of the sample mean 은 ?
N n ≫ 이라 가정 (population 의 특성이 sampling 중에 변하지 않는다 .)
Var mean
square of - square of the mean
n
i
n
jX
nXiXjEX
1 1
2
2 ]1
[)ˆ(
X̂
가정 : statisticallyindep.
따라서 Var
(!)
n
i
n
jX
nXiXjE
1 1
2
2 ][1
XjXi& ji XXiXjE
2][ ji
X 2 ji
nn
nnX
XX
XXnXn
222
2222
2 ])([1ˆ
Where is the true variance of the population As n => ∞, Variance => 0,
Which means that large sample sizes lead to a better estimate
* 참고 : 1)N 이 크지 않을 때 N 이 클 때와 같은 효과를 얻을 수 있는 방법 “sampling with replacement”
2
2)N 이 작고 replace 할 수 없을 때는Var
N->∞ 앞식으로 수렴N = n 일때는 0 ( 당연 !)
`Two examples : 교재 pp163 ~165 참조
)1
(ˆ2
N
nN
nX
4.3 Sampling Theory – The sample Variance
The population variance is needed for determiningthe sample size required to achieve a desired varianceof the sample mean (see eq. 4-4)
Definition(Sample Variance):
The expected value of the sample variance
can be derived easily using
not the true variance , that is, a biased estimate rather than an unbiased one
n
iXXS in 1
22 ˆ1
22 1][
n
nE S
n
j
Xjn
X1
1ˆ
2
2
Now, we redefine the sample variance for having an unbia
sed estimate of the population variance :
Note that these hold for very large N, that is, N=∞.
How about when the population size is not large?
n
iXX
SS
in
n
n
1
2
22
ˆ
~
1
1
1
# When N is not large, the expected value of S2 is given by
For obtaining an unbiased estimate, we redefine
# The variance of the estimates of the variance :
the variance of S2 :
the variance of :
where is the 4th central moment of the population
22 1
1][
n
n
N
NE S
SS n
n
N
N 22
1
1~
1 2)4( 42~
n
nVar S
n
Var S 4
42
S~2
][4
4 XXE
4.4 Sampling Distributions & Confidence Intervalswhat is the probability that the estimates are within specified bounds?
p,d,f 를 알아야 함2 가지 종류 , 그리고 sample mean 에 대해서만 !
normalized sample mean Xi 가 Gaussian and independent 일때
=> Gaussian (0,1)
n
XXZ
ˆ
Xi 가 not Gaussian 이더라도 n=>∞ 이면Z 는 asymptotically Gaussian by the
central limit theorem(n 은 보통 n≥30 은 되어야 함 ; A rule of
thumb)
H.W) Solve the problems in chap.4;4-2.1, 4-2.5, 4-3.1, 4-4.1, 4-5.1, 4-6.1
를 모를 때 대신에 로 대치그러나
No longer Gaussian =>”Student’s t distribution” with n-1 d.of f.
그림 p170 그림 4-2 참조
S~
1
ˆ~ˆ
nS
XX
nS
XXT
`pdf of student’s t distribution
Where the gamma heavier tails (n ≥30) n 의 유사 any
= ! integer
1n
2
1)1(
)2
(1
)2
1(
)(2
tf
Tt
T
(.);)1(
)()1( kkk kk k
( 당연히 )confidence interval 이란 ?
interval estimate ( 어떤 확률을 가지고 구간 내에 존재하는 가를 따짐 )q- percent confidence interval (q/100 의 확률을 갖고 ) 신뢰도
)2
1(,1)2()1( p
n
kXX
n
kX
ˆ
• 여 기 서 k 는 q 와 의 pdf 에 의존하는 상수임 .
• k 의 구체적인 값은 p.172 표 .4-1 참조 .
• (q 가 클수록 k 가 커짐 )
x̂
kx
kx xdxxfq )(100 ˆ
• 예 ) q=95% -> • 가 이 구간에 놓일 확률은 0.95 이다 .• 구간이 작을수록 확률이 적어짐• (q=99% 인 경우는 가 동일 구간이 넓어지나 추정에 필요한 정보 효용성은 떨어짐 !)
x̂
x̂
196.10ˆ804.9 x
• 참고 : q from PDF
• 여기서 F 는 Prob. Distribution for Student’s + function
• (See Appendix F or Table 4-2 page 172 for v = 8 )
)()(100 ˆˆ kxFkxFqxx
4.5 Hypothesis Testing
• The question arises; How does one decide to accept or reject a given hypothesis when the sample size and the confidence level are specified?
• Two steps; i) to make some hypothesis about the population
• ii) to determine if the observed sample confirms or rejects this hypothesis.
• Two tests; one-sided or two-sided.
The average life time of the light bulb >= 1000 hours
100ohms resisters too high or too low
One-sided test 경우예 ) A capacitor manufacturer claims
that a mean value of breakdown voltage >= 300 V
• a sample of 100 capacitors– >
• 99% confidence level is used• 문 ) Is the manufacturer’s claim valid?• 답 ) We would reject the hypothesis!
)40,400()~,ˆ( 22 VVsx
Normalized r, v, Z
그런데 99% 의 신뢰수준은
5.2100/40
300290
/
n
Xxz
cz cZZ zdzzfzF 99.0)(1)()(
5.233.2 cz
Vx 300Vs 40~
- 2.5 - 2.33
• 만약 99.5% 신뢰수준이라면– accept the hypothesis
• 신뢰수준이 낮을수록 구간이 좁아지고 가설을 받아들이기에 less likely
• 즉 more severe requirement 제시• 이것은 의미상 모순적으로 느껴짐
5.2575.2 cz
• 이제 유의 수준 (level of significance)으로 재정의하자
• 즉 (100% - 신뢰수준 )• 유의수준이 클수록 more severe!
• 예 ) 계속 sample size=9, • no longer Gaussian -> Student’s + distributi
on
• v=n-1=8 dof• 신뢰수준 99%,
– accept the hypothesis
)40,290( 2
75.0/~
ns
Xxt
75.0896.2 ct
• a small sample size 는 t 를 증가시키고
• heavier tail 을 가지고 있는 t distribution 을 를 감소
more likely to exceed the critical valuesmall size less reliable(less severe) than
large size tests
Two-sided test 경우• 예 ) A manufacture of Zener diodes clai
ms that the true mean breakdown voltage = 10V
• 문 ) hypothesis : the true accepts or rejects?
• 100 samples ->• 95% 신뢰수준
)2.1,3.10( 2VV
• 답 ) Rejected!
• z is outside the interval,
5.2100/2.1
103.10
/
n
Xxz
96.196.1 z
• 문 ) 계속 9 samples
t is inside the interval,
• accepted!– Less severe than a large sample test
75.010/2.1
103.10
/~
ns
Xxt
306.2306.2 t
)2.1,3.10( 2VV
2.5% 2.5%
95%tc=2.306
4.6 Curve Fitting and Linear Regression
• 변수들간의 ( 독립변수와 종속변수 ) 간의 함 수 관 계 를 자 료 를 매 개 체 로 하 여 통계적으로 찾아보는 분석방법 즉 , x 와 y의 관련성을 적절한 회귀방정식을 찾아 알아 보려함 .
• 대개 1 차식 (linear) or 2 차식• 반면 다음 절의 상관분석 (correlation analys
is) 는 x 와 y 의 관련성을 상관계수를 구하여 알아 보려함 .
• 용어– Scatter diagram ( 산점도 ) data 도시
- n samples
nn yyyxxx ,,,,,, 2121
- Curve fitting to find a mathematical relationship regression curve (equation) ; resulting curve
- What is the “best” fit? In a least squares sense
– Let be the errors between the regression curve and the scatter diagram
– 이것을 minimum 으로 하는 미지계수를 정하는 문제임 .
– 먼저 the type of equation to be fitted to the data 를 정하고 미지계수 수가 n 보다 훨씬 작게하면 smoothing 효과 얻음
222
21 n
i
2cxbxay
• Linear regression
• 이 최소가
되도록하는 a, b 는 ?
bxay
n
iii bxayJ
1
2)(
• 해 )
• 연립방정식을 풀면
n
i
n
iii xbany
a
J
1 10
n
i
n
ii
n
iiii xbxayx
b
J
1 1
2
10
2
11
2
111
n
ii
n
ii
n
ii
n
ii
n
iii
xxn
yxyxnb
n
xbya
n
ii
n
ii
11
MATLAB in function, p = polyfit(y, x, n)
• A second-order regression ( 교 재 p.180, 표 4-3, 그림 4-6)
0500.4266540.00334.0 2 TTvB
4.7 Correlation between Two Sets of Data
• Two data sets correlated or not?
nxxx ,,, 21
n
iixn
x1
1
nyyy ,,, 21
n
iiyn
y1
1
• Linear correlation coefficient“ Pearson’s r ”
Usage ; useful in determining the sources of errors예 ) a point-to-point digital communication link
BER(Bit Error Rate) 로 이 link 의 quality 판단BER may fluctuate randomly due to wind
문 ) error source 는 wind 인가 ?wind 속도 20 개 측정치와 resulting BER 과의 correlation test → r=0.891 충분히 크므로 yes!
1r
Gaussianelyapproximat500)( large;randomalso
)()(
))((
1
2
1
2
1
rnr
yyxx
yyxxr
n
ii
n
ii
n
iii