3 선형회귀분석(3), 조사자료의 이해snu-dhpm.ac.kr/pds/files/3 선형회귀분석(3), 조사자료의 이해.pdf · 복합표본자료를Stata에서분석할때 • 원시데이터에서다음요소확인

선형회귀분석(3) &조사자료의 이해

의료관리학 계량분석 방법론

2015. 9. 15.

서울대학교 의과대학 의료관리학교실

도 영 경

선형회귀분석: 남은 몇 가지

• t-test, F-test, ANOVA

• 통계적 유의성(statistical significance)• 정책적 유의성(policy/practice/clinical significance)

• 가변수(dummy variable) 해석• Logged dependent and independent variables

• 효과조절(moderation)• 매개(mediation)

지역사회건강조사

효과조절(moderation)

Living with dementedhousehold member

IncomeSelf-rated health

VAS

매개(mediation)

Income

Private healthinsurance

Self-rated healthVAS

조사자료의 이해

• Weights, Clustering, Strata– Conceptual understanding– Consequences when ignored (Option 1, 2, 3)– Exercise

• Missing data

용어: random vs. randomized

• Random (sampling)

– 확률추출(비확률 추출이 아니라)

– 모집단으로부터 표본을 추출하는 과정의 문제

• Randomized (controlled trial)

– 반드시 2개 이상의 그룹(treatment, control)

– 연구대상을 각기 다른 군으로 할당하는 문제

세 가지 선택

가중치 사용? 디자인 특성 고려?

Option 1 No No

Option 2 Yes No

Option 3 Yes Yes

Weights

• Conceptual understanding– If a group is oversampled, then the sampling probability will be higher

and the expansion weights will be lower for the group than othergroups.

• Consequences when ignored– Biased estimates

• Serious in descriptive statistics

• Could be okay in a regression if related variable is controlled in theregression model

– Option 1 gives wrong point estimates compared withOption 2 & 3.

Stratification• Conceptual understanding: Why done?

– Improving precision so that the sample is more likely to berepresentative

– 인구집단 이질성이 클 것으로 예상될 때– 시도, 읍/면부, 아파트/주택

• Consequences if ignored– Usual SEs are too large

Clustering• Conceptual understanding: Why done?

– Practical issues (mainly costs)– 가구, 학급, 학교, 보건소 등

• Consequences if ignored– Usual SEs underestimate true SEs (overestimate statistical significance)

Stratification and clustered sampling

• In general, the SE effect is more pronounced for clusteringbecause stratification is usually mild.

• Although the two are considered from different reasons, theyend up with generating unique primary sampling units (psu).

• Option 2 gives wrong SEs as opposed to Option 3 (but,parameter estimates are exactly the same between Option 2& 3).

Putting things together: Exercise

Descriptive statistics Multiple regression model

Var Mean orfreq.

SE Coeff. SE

Option 1 N-W

INC

…

0.3

30K

0.03

150

3.9

-0.9

0.029

0.1

Option 2 N-W

INC

…

0.2

40K

0.01

100

( )

( )

0.3

0.2

Option 3 N-W

INC

…

( )

( )

0.02

200

4.0

-1.0

0.6

0.8

1. Are Non-white oversampled in this sample? How can we know that?

2. Fill in the blanks.

3. When complex survey design features are ignored, are we more likely or less likely to reject H0?

4. Give one possible explanation why coefficient estimates in the multiple regression are not so markedlydifferent across Options as the mean estimates in descriptive statistics.

Missingness

• Missing Completely at Random (MCAR)

• Non-ignorable

• Complete Case Analysis (CCA)

• Dummy Variable Analysis (DVA)

복합표본자료를 Stata에서 분석할 때

• 원시데이터에서 다음 요소 확인

– 층화변수:지역(시도, 시군구) 등

– 집락변수(클러스터, PSU): 1차 추출단위 예)가구, 학교,

– 가중치

• . svyset 집락변수 [pw=가중치], strata(층화변수)

• . svy: 명령어 변수1 변수2 …

• 예) 전국 시군구별로 총 2,000개 초등학교를 선정하여 학생 조사.

– 층화변수: 시군구

– 집락변수: 2,000개 초등학교

– 가중치: 각 학생(응답자)가 전국 초등학생을 대표하는 비율

– svyset 초등학교번호 [pweight=가중치], strata(시군구)

• 예) 전국 16개 시도를 대도시, 중소도시, 농어촌으로 층화 후 인구주택총조사의 조사구를 1차추출단위, 가구와 가구 내 가구원을 2차 추출하는 경우

– svyset 1차추출단위 [pweight=가중치], || 2차추출단위, strata(도시구분)

현재 svy:에서 가능한 분석 명령어

scobit Skewed logistic regression

probit Probit regression

logit Logistic regression, reporting coefficients

logistic Logistic regression, reporting odds ratios

hetprobit Heteroskedastic probit regression

cloglog Complementary log-log regression

biprobit Bivariate probit regression

Binary-response regression models

streg Parametric survival models

stcox Cox proportional hazards model

Survival-data regression models

sem Structural equation model estimation command

Structural equation models

truncreg Truncated regression

tobit Tobit regression

regress Linear regression

nl Nonlinear least-squares estimation

intreg Interval regression

glm Generalized linear models

etregress Linear regression with endogenous treatment effects

cnsreg Constrained linear regression

Linear regression models

total Estimate totals

ratio Estimate ratios

proportion Estimate proportions

mean Estimate means

Descriptive statistics

Command Description

The following estimation commands support the svy prefix.

heckprobit Probit model with sample selection

heckoprobit Ordered probit model with sample selection

heckman Heckman selection model

Regression models with selection

ivtobit Tobit model with continuous endogenous regressors

ivregress Single-equation instrumental-variables regression

ivprobit Probit model with endogenous regressors

Instrumental-variables regression models

zip Zero-inflated Poisson regression

zinb Zero-inflated negative binomial regression

tpoisson Truncated Poisson regression

tnbreg Truncated negative binomial regression

poisson Poisson regression

nbreg Negative binomial regression

gnbreg Generalized negative binomial regression

Poisson regression models

slogit Stereotype logistic regression

oprobit Ordered probit regression

ologit Ordered logistic regression

mprobit Multinomial probit regression

mlogit Multinomial (polytomous) logistic regression

clogit Conditional (fixed-effects) logistic regression

Discrete-response regression models

Using Stata

. svyset PSU [pweight = WT_ex], strata(Kstrata)

. svy : mean HE_BMI


. svy : tab HE_OBE, missing

* Svyset 명령문은 SAS와 달리 한번만 실행시켜주면 여러 번 할 필요 없음.

국민건강영양조사 자료에 나와 있는 SAS 프로그램을 Stata로 바꿈

Using Stata


. svy : regress HE_BMI sex


. svy : logit HE_BMI25=sex

* Svyset 명령문은 SAS와 달리 한번만 실행시켜주면 여러 번 할 필요 없음.

[pweight=가중치변수]

• Stata에서 조사자료의 가중치는 “pweight”

• pweights, or sampling weights, are weights thatdenote the inverse of the probability that theobservation is included because of the sampling design.

– 주어진 표본 관측치가 그 집락 또는 층에서 뽑힐 확률의역수->따라서 곱해지면 그 집단별 합과 각 층별 건수추정 가능. 조사자료에서의 가중치는 pweight를 말함.

– 예를 들어 특정 관측치의 가중치가 200일 경우 그관측치는 모집단의 200개 값을 대표함. (추출확률 1/200)

http://www.stata.com/manuals13/u11.pdf#u11.1.6weighthttp://www.stata.com/manuals13/u20.pdf#u20.23Weightedestimation2016-02-24 22

패널 자료에서의 가중치

• 횡단 가중치: 개별 wave(조사차수)를 기준으로 한 가중치.– 각 연도별 모수(평균, 분산, 비율)를 추정할 경우

– 각 연도별 동일 변수의 변화 추세

• 종단 가중치: 조사가 실시된 1차년도부터 가중치 작업 당해연도까지 모두 응답한 자료만을 추출하여 그 자료를 최종응답자로 보고 그 최종 응답자에 대한 가중치 작업 실시.– 어떤 특성을 가진 그룹이 이전 연도에서 이후 연도로 얼마나

변화했는지 주제를 가진 분석.• 2차년도에 A라는 특성을 가진 학생들 중, 3차년도에 B라는 특성을 갖게 된

사례를 분석하는 경우, 3차년도의 종단면 가중치 활용.

한국교육종단연구2005(VI) 종단적 가중치 및 무응답 대체법 연구. 한국교육개발원 보고서2016-02-24 23

참고문헌

• 국내보건의료 이차자료원 활용, NECA, 2013

• 한국교육종단연구2005(VI) 종단적 가중치 및 무응답 대체법 연구. 한국교육개발원 보고서

• http://www.stata.com/manuals13/svy.pdf

• 손창균, 홍기학, 이기성, “표본추출 및 관리 매뉴얼”, 한국보건사회연구원 정책보고서.

• www.stata.com/manuals13/svy.pdf

• http://www.ats.ucla.edu/stat/stata/seminars/applied_svy_stata13/

• http://yhs.cdc.go.kr/upload/board/20121217105123227.pdf

• http://healthstat.snu.ac.kr/hokim/seminar/sampling20070118.pdf

• http://www.cpc.unc.edu/research/tools/data_analysis/sas_to_stata/sas_to_stata.html

2016-02-24 24

Documents

3 선형회귀분석(3), 조사자료의 이해snu-dhpm.ac.kr/pds/files/3 선형회귀분석(3), 조사자료의 이해.pdf · 복합표본자료를Stata에서분석할때 • 원시데이터에서다음요소확인