Metlit-02 Populasi, Sampel & Variabel - Prof. dr. Sudigdo S, SpA(K).ppt

[email protected]/2010

Population & Sample

Sudigdo Sastroasmoro


Population is a large group of study subjects (human, animals, tissues, blood specimens, medical records, etc) with defined characteristics [“Population is a group of study subjects defined by the researcher as population”]

Sample is a subset of population which will be directly investigated. Sample should be (or assumed to be) representative to the population; otherwise all statistical analyses will be invalid

All investigations are always performed in the sample, and the results will be applied to the population


Avoid using ambiguous terms

Sample populationSampled populationPopulasi sampelStudy population ~ sample


Gap between Das Sein & Das Sollen

Literature study

Research question(s) / Hypothesis

Methods / Design

Data collection &analyses

Conclusions

In the real world(“Population”)

In the sample

Infer


Sample is assumed to be representative

to the population. In research: measurements are always done in the sample, the

results will be applied to population.

S

P P

S


P S

Investigation

S P

S

Sampling

Results

Inference


Target population Accessible population

IntendedSample

Actualstudy subjects

Actualstudy subjects


Target population = domain = population in which the results of the study will be applied. In clinical research it is usually characterized by demographic & clinical characteristics; e.g. normal infants, teens with epilepsy, post-menopausal women with osteoporosis. Accessible population = subset of target population which can be accessed by the investigator. Frame: time & place. Example: teens with epilepsy in RSCM, 2000-2005; women with osteoporosis, 2002 RSGSIntended sample = subjects who meet eligibility criteria and selected to be included in the studyActual study subjects = subjects who actually completed the participation in the study


Accessible population(+ time,

place)

Usually based on practicalpurposes

Appropriatesampling technique

[Non-response, drop outs,withdrawals, loss to follow-up]

Target population

(demographic, clinical)

IntendedSample

[Subjects selectedfor study]

Actualstudy

subjectsSubjects

completedthe study


Target Population(Domain)

Accessible population

IntendedSample

Actualstudy

subjects

External validity II:Does AP represent TP?

[Internal validity: does ASS represent IS?]

[External validity I:Does IS represent AP?}


Internal validity: how well the study was done (usu. measurement, but also incl. whether actual study subjects represent intended sample or not). Many drop outs? loss to follow up? low compliance?.External validity I: assess whether intended sample represents accessible population (random sampling? convenient sampling?) External validity II: whether accessible population represents target population. This cannot be calculated, but can be judged by common sense & general knowledge

Validity: Internal & external


A. Probability samplingSimple random sampling (r. table, computer generated)Stratified random samplingSystematic samplingCluster samplingOthers: two stage cluster sampling, etc

B. Non-probability samplingConsecutive samplingConvenience sampling Judgmental sampling / Purposive sampling

Sampling methods


Predicting the 1936 Election

In 1936, Literary Digest mailed questionnaires to 10 million people, asking who they would vote for in the upcoming presidential election. The list was complied from magazine subscribers, car owners and telephone directories. Based on the 2.3 million responses, they predicted a victory for Republican Landon over Roosevelt by a 60 to 40 margin.Roosevelt won with 61% of the vote, to 36% for Landon.George Gallup correctly predicted the election—and the results of the Literary Digest poll!—to within 1 percent, using random samples.


Probability sampling (1)Simple random sampling: – Select 50 out of 900 students 1. Using Random number table:

o Example: 146*72 2*238*9 12*970 *127*63 8*759*0 29*874

*390*48 6*83012. Using computer generated random numbers (pseudo-random) Command: How many subjects do you have? 900

How many do you want to select? 50Enter → 017, 068, 113, 142, etc

Repeating the procedure exactly will result in completely different numbers


Simple Random Sample: n = 20, N= 2000


Probability sampling (2)

Systematic sampling: Every m subject is selectedSelected number: k

Example: k =3, m =10:3, 13, 23, 33, 43, etc

Better (more representative) than SRS if no natural trends or strata


Systematic sample: N = 2000, n = 20, m = 100, k = 45

45, 145, 245, ………1945



Stratified [random] sampling: Random sampling is done in each strata separately, e.g., by sex, age group, stage of disease, etcThe results then combined


Stratified sample of 20 from 4 strata



Cluster sampling

Subjects are selected separatelyaccording to cluster or place (RT, RW,district, etc)


Cluster Sample of 20 (cluster size = 4)


Non-probability sampling (1)

Consecutive sampling:

Subjects are selected according to theirappearance on the listMost commonly used in clinical studies

Can be expected resembling randomsampling if time span is long enough

This is the best of non-probability sampling


Non-probability sampling (2)

Convenience samplingJudgmental sampling

They are rarely justified except for certain conditions, e.g. normal values


All statistical analyses (inferences) are based on (simple) random samplingWhether or not a sample is representative to the population depends on whether or not it resembles the results if it were done by random sampling

Note


How to generalize results in the sample

to the population:

Introduction to statistical inference


IMPORTANT!!!Statistical significance vs. clinical

importanceNegligible clinical difference may be statistically very significant if the number of subjects >>>. e.g., difference in reduction of cholesterol level of 3 mg/dl, n1=n2 = 10,000; p = 0.00002Large clinical difference may be statistically non-significant if the no of subjects <<<, e.g. 30% difference in cure rate, if n1 = n2 = 10, p = 0.74


R

x = 300 mg/dl

x = 300mg/dl

Standardtreatment

New treatment

Cholesterol level, mg/dl

t = df = 9998 p = 0.00002

x = 200

x = 197

Clinical

Statistical

Clinical importance vs. statistical significance

n=10000

n=10000


Cured Died

Standard Rx 0 10 (100%)

New Rx 3 7 (70%)

Fischer exact test: p = 0.211

Clinical significance vs. statistical significance

Absolute risk reduction = 30% Clinical

Statistical


Abstract• Objectives:• Methods:• Results: After 2 months of

treatment, there was significant difference in LDL (P = 0.0032), HDL (P = 0.048), but there was no significant difference in triglyceride (P= 0.073) between the 2 groups.

• Conclusion:


Can the results of the study (in sample) be applied in the accessible or target population?Hypothesis testing & confidence interval

Introduction to statistical inference


Statistic and Parameter

An observed value drawn from the sample is called a statistic (cf. statistics, the science)The corresponding value in population is called a parameterWe measure, analyze, etc statistics and translate them as parameters


Examples of statistics:

ProportionPercentageMeanMedian ModeDifference in proportion/mean

ORRRSensitivitySpecificityKappaLRNNT


There are 2 ways in inferring statistic into parameter:

Hypothesis testing p valueEstimation: confidence interval (CI)

P Value & CI tell the same concept in different ways


P value

Determines the probability that the observed results are caused solely by chance (probability to obtain the observed results if Ho were true)


C 30 (60%) 20 (40%) 50

E 40 (80%) 10 (20%) 50

X2= ; df = 1; p = 0.0432

Group Success Failure Total


C 30 (60%) 20 (40%) 50

E 40 (40%) 10 (20%) 50

X2= ; df = 1; p = 0.0432

Group Success Failure Total

If drugs E and C were equally effective, we still can have the above result (difference of success rate of 20%)

but the probability is small (4.32%)

If drugs E and C were equally effective, the probability that the result is merely caused by chance is 4.32%

If we define in advance that p<0.05 is significant,than the result is called statistically significant


Similar interpretation applies to ALL hypothesis testing: t-test, Anova,

non-parametric tests, Pearson correlation, multivariate tests, etc:

If null-hypothesis null were true, the probability of obtaining the

result was ……. (example 0,02 or 2%, etc)


Confidence Intervals

Estimate the range of values (parameter) in the population using a statistic in the sample (as point estimate)


X XX

If the observedresult in the

sample is X, whatis the figure inthe population?

CI

A statistic (point estimate)

S

P


Most commonly used CI:

CI 90% corresponds to p 0.10CI 95% corresponds to p 0.05CI 99% corresponds to p 0.01

Note:p value only for analytical studiesCI for descriptive and analytical studies


How to calculate CI

General Formula:

CI = p Z x SE

•p = point of estimate, a value drawn from sample (a statistic)

•Z = standard normal deviate for , if = 0.05 Z = 1.96 (~ 95% CI)


Example 1

100 FKUI students 60 females (p=0.6)What is the proportion of females in Indonesian FK students? (assuming FKUI represents FK in Indonesia)


Example 1

70501060

96160

10040609616095

.;...

..

....%

npqSE(p)

=±=

±=

±=

=

X0.5/10

xCI


Example 2: CI of the mean

• 100 newborn babies, mean BW = 3000 (SD = 400) grams, what is 95% CI?

95% CI = x 1.96 x SEM

3080;2920

)803000();803000(803000100

400x96.13000CI%95

nSDSEM


Examples 3: CI of difference between proportions (p1-p2)

• 50 patients with drug A, 30 cured (p1=0.6)• 50 patients with drug B, 40 cured (p2=0.8)

29.0;11.0)09.02.0();9.02.0()pp(CI%95

09.050

4.0

50

)2.08.0(

50

)4.06.0(

n

qp

n

qp)pp(SE

)pp(xSE96.1)pp()pp(CI%95

21

2

21

2

1121

212121


Example 4: CI for difference between 2 means

Mean systolic BP:50 smokers = 146.4 (SD 18.5) mmHg50 non-smokers = 140.4 (SD 16.8) mmHg

x1-x2 = 6.0 mmHg

95% CI(x1-x2) = (x1-x2) 1.96 x SE (x1-x2)

SE(x1-x2) = S x V(1/n1 + 1/n2)


Example 4: CI for difference between 2 means

V

13.01.0;)(1.96X3.536.095%CI

3.53501

501

17.7)xSE(x

17.798

16.24918.6)(49s

2)n(n1)s(n1)s(n

s

21

21

222

211


Other commonly supplied CI

Relative risk (RR)Odds ratio (OR)Sensitivity, specificity (Se, Sp)Likelihood ratio (LR)Relative risk reduction (RRR)Number needed to treat (NNT)


Altman & Gore

• Statistics with confidence


Suggested CI presentation:

• 95%CI: 1.5 to 4.5• 95%CI: -2.5 to 4.3• 95%CI: 12 to -6

• Not recommended: 3 + 1.5• Not recommended: -9 + -3


In contrast to CI for proportion, mean, diff. between proportions/means, where the values of CI are symmetrical around point estimate, CI’s for RR, OR, LR, NNT are asymmetrical because the calculations involve logarithm


Examples

RR = 5.6 (95% CI 1.2 ; 23.7)OR = 12.8 (95% CI 3.6 ; 44,2)NNT = 12 (95% CI 9 ; 26)


If p value <0.05, then 95% CI:exclude 0 (for difference), because if A=B then A-B = 0 p>0.05exclude 1 (for ratio), because if A=B then A/B = 1, p>0.05

For small number of subjects, computer calculated CI may not meet this rule due to correction for continuity automatically done by the computer


Concluding remarksIn every study sample should (assumed to) be representative to the population. Otherwise all statistical calculations are not validp values (hypothesis testing) gives you the probability that the result in the sample is merely caused by chance, it does not give the magnitude and direction of the differenceConfidence interval (estimation) indicates estimate of value in the population given one result in the sample, it gives the magnitude and direction of the difference


Concluding remarks

p value alone tends to equate statistical significance and clinical importanceCI avoids this confusion because it provides estimate of clinical values and exclude statistical significance whenever applicable, supply CI especially

for the main results of study in critical appraisal of study results, focus

should be on CI rather than on p value.

Documents

Metlit-02 Populasi, Sampel & Variabel - Prof. dr. Sudigdo S, SpA(K).ppt