40
Basics of statistics Conducted by Dept. of Biostatistics, NIMHANS From 28 to 30 Sept, 2015 "It is easy to lie with statistics, But it is hard to tell the truth without statistics." –Andrejs Dunkels

Basics of statistics

Embed Size (px)

Citation preview

Page 1: Basics of statistics

Basics of statistics

Conducted by Dept. of Biostatistics, NIMHANS From 28 to 30 Sept, 2015

"It is easy to lie with statistics, But it is hard to tell the truth without statistics."

–Andrejs Dunkels

Page 2: Basics of statistics

Topics covered• Introduction• Types of statistics• Definitions• Variable & Types• Variable scales• Description of data• Distribution of sample &

population• Measures of center, dispersion &

shape• Properties of Normal distribution• Testing of hypothesis• Types of error

• Estimation of sample size• Various tests to be used• Central limit theorem• Parametric tests- t-test, ANOVA,

Post Hoc, Correlation & Regression

• Non Parametric tests• Tests for categorical data• Summary of tests to be used• Qualitative vs Quantitative

research• Qualitative research• Software packages

Page 3: Basics of statistics

Statistics • Consists of a body of methods for collecting and analyzing data.

• It provides methods for-

– Design- planning and carrying out research studies

– Description- summarizing and exploring data

– Inference- making predictions & generalizing about phenomena represented by data.

Page 4: Basics of statistics

Types of statistics• 2 major types of statistics• Descriptive statistics- It consists of methods for organizing and

summarizing information.– Includes- graphs, charts, tables & calculation of averages, percentiles

• Inferential statistics- It consists of methods for drawing and measuring the reliability of conclusions about population based on information obtained.– Includes- point estimation, interval estimation, hypothesis testing.

• Both are interrelated. Necessary to use methods of descriptive statistics to organize and summarize the information obtained before methods of inferential statistics can be used.

Page 5: Basics of statistics

Population & Sample• Basic concepts in statistics.

• Population- It is the collection of all individuals or items under consideration in a statistical study

• Sample- It is the part of the population from which information is collected.

• Population always represents the target of an investigation. We learn about population by sampling from the collection.

Page 6: Basics of statistics

• Parameters- used to summarize the features of the population under investigation.

• Statistic- it describes a characteristics of the sample, which can then be used to make inference about unknown parameters.

Page 7: Basics of statistics

Variable & types • Variable- a characteristic that varies from one person or thing to another.

• Types- Qualitative/ Quantitative, Discrete/ Continuous, Dependent/ Independent

• Qualitative data- the variable which yield non numerical data.– Eg- sex, marital status, eye colour

• Quantitative data- the variables that yield numerical data– Eg- height, weight, number of siblings.

Page 8: Basics of statistics

• Discrete variable- the variable has only a countable number of distinct possible values.– Eg- number of car accidents, number of children

• Continuous variable- the variable has divisible unit.– Eg- weight, length, temperature.

• Independent variable- variable is not dependent on other variable.– Eg- age, sex.

• Dependent variable- depends on the independent variable.– Eg- weight of a newborn, stress

Page 9: Basics of statistics

Variable scales • Variables can also be described according to the scale on which they are

defined.

• Nominal scale- the categories are merely names. They do not have a natural order. – Eg- male/female, yes/no

• Ordinal scale- the categories can be put in order. But the difference between the two may not be same as other two. – Eg- mild/ Moderate/ Severe.

Page 10: Basics of statistics

• Interval scale- the differences between variables are comparable. The variable does not has absolute zero.– Eg- temperature, time

• Ratio scale- the variable has absolute zero as well as difference between variables are comparable..– Eg- stress using PSS, insomnia using ISI

• Nominal & Ordinal scales are used to describe Qualitative data.• Interval & Ratio scales are used to describe Quantitative data.

Page 11: Basics of statistics

Describing data• Qualitative data-

– Frequency- number of observations falling into particular class/ category of the qualitative variable.

– Frequency distribution- table listing all classes & their frequencies.

– Graphical representation- Pie chart, Bar graph.

– Nominal data best displayed by pie chart– Ordinal data best displayed by bar graph

Page 12: Basics of statistics

• Quantitative data-– Can be presented by a frequency distribution.– If the discrete variable has a lot of different values, or if the data is a

continuous variable then data can be grouped into classes/ categories.

– Class interval- covers the range between maximum & minimum values.

– Class limits- end points of class interval.– Class frequency- number of observations in the data that belong to

each class interval.

– Usually presented as a Histogram or a Bar graph.

Page 13: Basics of statistics

Population & Sample distribution• Population distribution- frequency distribution of the population.• Sample distribution- frequency distribution of the sample.

• Sample distribution is a blurry photo of the population distribution.• As the sample size ↑, the sample distribution becomes closer representative

of the population distribution.

• Sample of population distribution can be summarized by describing its shape (based on the graph).

• It can be Symmetric or Nonsymmetric/ Skewed to left/ right based on its tail.

Page 14: Basics of statistics

Properties of Numerical data &

Measures

Central tendency

Mean

Median

Mode

Dispersion

Range

Interquartile Range

Standard Deviation

Shape

Skewness

Kurtosis

Page 15: Basics of statistics

Measures of center• Central tendency- In any distribution, majority of the observations pile

up, or cluster around in a particular region.– Includes- Mean, Median & Mode.

• Mean- sum of observed values in a data divided by the number of observations

• Median- observation in the data set that divides the data set into half.• Mode- value of the data set which occurs with greatest frequency

• Mean & Median can be applied only to Quantitative data• Mode can be used either to Qualitative or Quantitative data.

Page 16: Basics of statistics

What to choose?• Qualitative variable- Mode.• Quantitative with symmetric distribution- Mean.• Quantitative with skewed distribution- Median.

• Outlier- observation that falls far from the rest of the data. Mean gets highly influenced by the outlier.

• We use sample mean, median & mode to estimate the population mean, median & mode.

Page 17: Basics of statistics

Measures of dispersion• Dispersion- It is the spread/ variability of values about the measures of

central tendency. They quantify the variability of the distribution.• Measures include-

– Range– Sample interquartile range– Standard deviation

• Mostly used for quantitative data

• Range- difference between the largest observed value in the data set and the smallest one.– So, while considering range great deal of information is ignored.

Page 18: Basics of statistics

• Interquartile range- difference between the first & third quartiles of the variable.– Percentile- divides the observed values into hundredths/ 100 equal

parts.– Deciles- divides the observed values into tenths/ 10 equal parts– Quartiles- divides the observed values into 4 equal parts. Q1 divides

the bottom 25% of observed values from top 75%...

• Standard deviation- it is a kind of average of the absolute deviation of observed values from the mean of the variable.– It is defined using the sample mean & values get strongly affected by

few extreme observations.

Page 19: Basics of statistics

Shape • Skewness- Lack of symmetry in distribution. It can be interpreted from

frequency polygon.

• Properties-– Mean, median & mode fall at different points.– Quartiles are not equidistant from median.– Curve is not symmetrical but stretched more to one side.

• Distribution may be positively or negatively skewed. Limits for coefficient of skewness is ± 3.

• Kurtosis- convexity of a curve.– Gives an idea about the flatness/ peakedness of the curve.

Page 20: Basics of statistics

Normal distribution• Bell shaped symmetric distribution.• Why is it important?

– Many things are normally distributed, or very close to it.– It is easy to work with mathematically– Most inferential statistical methods make use of properties of the

normal distribution.

• Mean = Median = Mode

• 68.2% of the values lie within 1SD.• 95.4% of the values lie within 2SD.• 99.7% of the values lie within 3SD.

Page 21: Basics of statistics

Tests to check normal distribution1. Checking measures of Central tendency, Skewness & Kurtosis.2. Graphical evaluation- normal plot, frequency polygon.3. Statistical tests-

– Kolmogorov-Smirnov test– Shapiro-Wilk test– Lilliefor’s test– Pearson’s chi-squared test

• Shapiro-Wilk has the best power for a given significance.

• If not normally distributed?- correction by transformation of the data- log transformation, square root transformation.

Page 22: Basics of statistics

Hypothesis testing• Aim of doing a study is to check whether the data agree with certain

predictions. These predictions are called hypothesis.

• Hypothesis arise from the theory that drives the research.

• Significance test- it is a way of statistically testing a hypothesis by comparing the data values.– It consists of two hypothesis- Null (H0) & Alternative hypothesis (H1).– Null hypothesis is usually a statement that the parameter has value

corresponding to, in some sense, no effect.– Alternative hypothesis is a hypothesis contradicts null hypothesis.– Hypothesis are formulated before collecting the data.

Page 23: Basics of statistics

• Significance test analyzes the strength of sample evidence against the null hypothesis.

• The test is conducted to investigate whether the data contradicts the null hypothesis, suggesting alternative hypothesis is true.

• Test statistics- statistic calculated from the sample data to test the null hypothesis.

• p-value- is the probability, if H0 were true, that the test statistic would fall in this collection of values. The smaller the p-value, the more strongly the data contradicts H0.

• When p-value ≤ 0.05, data sufficiently contradicts H0.

Page 24: Basics of statistics

Types of error• Type I/ α error- Rejecting true null hypothesis.

– We may conclude that difference is significant, when in fact there is no real difference.

– It is popularly known as p-value. Maximum p-value allowed is called as level of significance. Being serious p-value is kept low, mostly less than 5% or p<0.05.

• Type II/ β error- Accepting false null hypothesis.– We may conclude that difference is not significant, when in fact there

is real difference.– It is also called as Power of the test & indicates sensitivity of the test.

• Not possible to reduce both type I & II, So α error is fixed at a tolerable limit & β error is minimized by ↑ sample size.

Page 25: Basics of statistics

Estimation of Sample size• Small sample- fails to detect clinically important effects (lack of Power)• Large sample- identify differences which has no clinical relevance.

• Calculation is based on (not included formulas)- – Estimation of mean– Estimation of proportions– Comparison in two means– Comparison in two proportions

• Checklist- level of significance, power, study design, statistical procedure.

• Minimum sample size required for statistical analysis- 50.

Page 26: Basics of statistics

Basic theorem in statistics• Central limit theorem-

– States that the distribution of the sum/ average of a large number of independent, identically distributed variables will be approximately normal.

• Why is this important?– Basis of many statistical procedures.

Page 27: Basics of statistics

Parametric tests• These are statistical tests that makes assumptions about the parameters

(defining properties).

• Assumptions made are-– Data follows normal distribution.– Sample size is large enough for Central limit theorem to lead to

normality of averages.– Data is not normal, but can be transformed.

• Some situations where data does not follow normal distribution-– Outcome is an ordinal variable.– Presence of definite outliers– Outcome has clear limits of demarcation.

Page 28: Basics of statistics

Tests to be usedScale type Permissible statistics

Nominal ModeChi-Square test

Ordinal Mode/ Median

IntervalMean, Standard

Deviationt-test, ANOVA, Post hoc, Correlation, Regression,

Ratio

One samplet-Test

Independent t- test

Dependent t- test

Compares the sample mean with the

population mean

Compares the means of two independent

samples

Compares the means of paired samples

(before-after, pre-post)

Page 29: Basics of statistics

ANOVA• t- Test- difference between 2 means.

– If there are more than 2 means, then doing t test increases the α & β error. Which creates a serious flaw.

• So when there are >2 means to be compared we use ANOVA.• Types-

– One way- study effects of one factors.– Two way- study effects of multiple factors.

• Assumptions of ANOVA- Normality, Linearity.

• ANCOVA- It is a blend of ANOVA & Regression. In other words, measures how much 2 variables change together & how strong is the relationship.

Page 30: Basics of statistics

Post Hoc• Latin phrase, means- “after this” or “after the event”

• Why do Post hoc tests?– ANOVA tells whether there is an overall difference between groups,

but it does not tell which specific group differed.– Post hoc tests tell where the difference occurred between groups.

• Different Post hoc tests-– Bonferroni – Fisher’s least significant difference (LSD)– Tukey’s honestly significant difference (HSD)– Scheffe post hoc tests

Page 31: Basics of statistics

Correlation & Regression• Correlation- denotes association between 2 quantitative variables.

– Assume that the association is linear (i.e.., one variable ↑/ ↓ a fixed amount for a unit ↑/ ↓ in the other).

– Degree of association is measured by a correlation coefficient, r.– r is measured on a scale from -1 through 0 to +1.– When both variables ↑, then r is + & when 1 variable ↑ and other

decreases, then r is -.

• Graphically- Scatter diagrams, usually independent variable is plotted against x-axis & dependent against y-axis.

• Limitation- it does not say anything about Cause & Effect relationship.– Beware of spurious/ non sense correlation.

Page 32: Basics of statistics

• Correlation- – Strength/ degree of association.

• Regression- – Nature of association (eg- if x & y related, it means if x changes by

certain amount then y changes on an average by certain amount).– Expresses the linear relationship between variables.– Regression coefficient- β– Types- Linear, Non linear, Stepwise

• Regression coefficient gives a better summary of the relationship between the two variables than Correlation coefficient.

Page 33: Basics of statistics

Non Parametric tests• Also called as “Distribution free tests”, because they are based on fewer

assumptions.

• Advantages-– When data does not follow normal distribution.– When the average is better represented by median.– Sample size is small.– Presence of outliers.– Relatively simple to conduct

Page 34: Basics of statistics

Tests Characters Parametric test Non Parametric test

Testing mean, a hypothesized value One sample t test Sign test

Comparison of means of 2 groups Independent t test Mann Whitney U test

Means of related samples Paired t test Wilcoxon Signed rank

test

Comparison of means of > 2 groups ANOVA Kruskal Wallis test

Comparison of means of > 2 related groups

Repeated measures of ANOVA Friedman’s test

Assessing the relationship between 2 quantitative variables

Pearson’s correlation Spearman’s correlation

Page 35: Basics of statistics

Chi-Square test• Used for analysis of categorical data. • Other tests- Fisher exact probability test, McNemar’s test.

• Requirements of Chi-Square-– Sample should be independent– Sample size should be reasonably large (n >40)– Expected cell frequency should not be < 5.

• Yate’s correction- if expected cell frequency is < 5• Fisher exact probability test- used when sample size is small (n < 20)• McNemar’s test- used when there are two related samples or there are

repeated measurements

Page 36: Basics of statistics

RR & OR• Relative Risk (RR)-

– It is the ratio of incidence rate among exposed to the incidence rate among not exposed.

– used in RCTs & Cohort studies– Values- <1 - risk of disease is less among exposed

– >1 – risk of disease is more among exposed– =1 – equal risk among exposed & non exposed

• Odds Ratio (OR)-– Ratio of odds of exposure among the cases to odds of exposure among

controls. Used for rare diseases/ events– Used in case control & retrospective studies (no meaning in calculating

the risk of getting the disease)– Values- >1- more among cases, <1- more among controls

Page 37: Basics of statistics

Qualitative v/s Quantitative

Qualitative research• Seeks to confirm hypothesis

• Highly structured methods used

• Uses closed ended, numerical methods of collecting data

• Study design is fixed & subject to statistical assumptions

Quantitative research• Seeks to explore phenomena

• Semi-structured methods used

• Uses open ended, textual methods

• Study design is flexible, iterative & subject to textual analysis

Page 38: Basics of statistics

Qualitative research• Provides complex descriptions & information about issues such as

contradictory behavior, belief, opinions, emotions & relationships.

• Methods used are-– Phenomenology– Ethnography– Grounded theory

• Designs used- – Case studies– Comparative designs– Snapshots – Retrospective & Longitudinal studies

Page 39: Basics of statistics

Statistical software packages

Quantitative research• SPSS by IBM

• R by R Foundation

• GenStat by VSN International

• Mathematica by Wolfram research

• Minitab, MATLAB, Nmath Stats etc..,

Qualitative research• ATLASti

• NVIVO

• MAXQDA

• NUDist

• ANTHTOPAC

Page 40: Basics of statistics

Thank you

"An approximate answer to the right problem is worth a good deal, more than an exact answer to an approximate problem." -- John Tukey