Upload
mereesha-k-noushad
View
37
Download
2
Embed Size (px)
Citation preview
BIOSTATISTICS
Statistics deals with
• Planning Research
• Collecting Data
• Describing Data
• Summarizing- Presenting Data
• Analyzing Data
• Interpreting Results
• Reaching decisions or discovering new knowledge
Biostatisitcs is the application of statistical methods to health sciences.
DEFINITION:
It is the method of collection, organizing , analysing, tabulating and interpretation of datas related to living organisms and human beings. [Soben Peter]
History
John Graunt (1620-1674) , who was neither a physician nor a mathematician is the father of Health Sciences. The term biometry was coined by W.F.R Weldon (1860-1906), a zoologist at University College, London.
Use of biostatistics
To test whether the difference between two populations is real or a chance occurrence.
To study the correlation between attributes in the same population. To evaluate the effect of vaccines, sera etc. To measure mortality and morbidity To evaluate achievements of public health programs. To fix priorities in public health programs. To help promote health legislation and create administrative standards for
oral health.
Aims of biostatistics To generate the statistical data through experimental investigation and
sample surveys. To organize and represent the data in suitable tables, diagrams, charts or
graphs, etc To draw valid inferences from the data collected, put forth definite
interpretations or predict the future outcomes from the data.
Why should medical/Dental students learn biostatistics?
1. Medicine is becoming increasingly QUANTITATIVE.
• The aim is to improve the Health Status of the population.
• We have to clarify the relationships between certain factors and diseases.
• Enumarate the occurances of diseases
• Explain the etiology of diseases (which factors cause which diseases)
• Predict the number of disease occurence
• Read, understand and criticize the medical literature.
2. The planning, conduct and interpretation of much of medical research are becoming increasingly reliant on statistical methods.
Planning
3. How many patients must be treated?
4. How do we have to allocate the subjects to treatments?
5. What are the other factors which may influence the response variable?
Conduct:
Under which conditions must the study be conducted?
Is matching necessary?
Is blinding (single blinding or double blinding) necessary?
Is there a need for a control group?
Shoud the placebo effect be considered?
Which experimental design technique is more appropriate?
Interpretation:
Example:
Distribution of Women with a Diagnosis of Tromboembolism Among Blood Groups
Blood Group Frequency %
A 32 58.2
AB 4 7.3
B 8 14.5
O 11 20.0
Total 55 100.0
Terminologies:
Data : Set of values of one or more variables recorded on one or more observational units.
Observation (case): Individual source of data.
Variable: This is a quantity which varies such that it may take any one of a specified set of values. It may be measurable or non-measurable.
Population: A collection, or set, of individuals, objects, or measurements whose properties are to be analyzed.
Sample: A subset of the population, selected in such a way that it is representative of the larger population.
Parameter : A summary value which in some way characterizes the nature of the population in the variable under study.
Statistic : A summary value calculated from a sample of observation.
DATA:
Sources of data
1. Routinely kept records
2. Published data sources
3. Data on electronic media
4. Surveys and Experimental research
5. Census
6. Generated or artificial data
Types of Data
1. Qualitative Data
Results from a variable that asks for a quality type of description of the subject.
2. Quantitative Data
Results from obtaining quantities-counts or measurements.
Types of biostatisticsDescriptive biostatistics Inferential biostatistics
Descriptive biostatistics:It is the study of biostatistical procedures which deal with the collection,
representation, calculation and processing, i.e., the summarization of the data to make it more informative and comprehensible. The primary function of descriptive statistics is to provide meaningful and convenient techniques for describing features of data that are of interest. The failure to choose appropriate descriptive statistics often lead to faulty scientific inference. The field of descriptive statistics is not concerned with the implications or conclusions that can be drawn from the sets of the data.
Inferential biostatistics:It constitutes the procedures which serve to make generalization or drawing
conclusions on the basis of the studies of the sample. This is also known as sampling biostatistics. The study of the quantitative aspects of the inferential process provides a solid basis, on which the more general substantive process of inference can be founded.
Basis for statistical analysesStatistical analyses are based on three primary entities:
The population 'U' that is of interest The set of characteristics (variables) of units of this population 'V' The probability distribution 'P' of these characteristics in the population
The population 'U'The population is a collection of units of observation that are of interest and
is the target of the investigation. For example , in determining the effectiveness of a particular drug for a disease, the population would consist of all possible patients with the disease. It is essential, in any research study, to identify the population clearly and precisely. The success of the investigation will depend to a large extent on the identification of the population of interest.
The variables 'V'A variable is a state, condition, concept or event whose value is free to vary
within the population. Once the population is identified, we should clearly define what
characteristics of the units of this population (subjects of the study) are we planning to investigate.
For example in the case of a particular drug, one needs to define the disease and what other characteristics of the people (e.g. age, sex, education, etc.) one intends to study.
Clear and precise definitions and methods for measuring these characteristics (a simple observation, a laboratory measurement, or tests using a questionnaire) are essential for the success of the research study.
Variables can be classified as ,
Independent variables: variables that are manipulated or treated in a study in order to see what effect, differences in them will have on those variables proposed as being dependent on them. Synonyms: cause, input, predisposing factor, antecedent, risk factor, characteristic, attribute, determinant.
Dependent variables: variables in which changes are results of the level or amount of the independent variable or variables. Synonyms: effect, outcome, consequence, result, condition, disease.
Confounding or intervening variables: variables that should be studied because they may influence or 'confound' the effect of the independent variables on the dependent variables. E.g. the study of tobacco (independent variable) on oral cancer (dependent variable), the nutritional status of the individual may play an intervening role.
Background variables: variables that are so often of relevance in investigations of the groups or populations that they should be considered for possible inclusion in the study. Synonyms: sex, age, ethnic origin, education, marital status, social status.
The probability distribution 'P'The probability distribution is a way to enumerate the different values the
variable can have, and how frequently each value appears in the population. The actual frequency distribution is approximated to a theoretical curve that is used as the probability distribution. Common examples of probability distributions are binomial and normal.
For e.g. the incidence of a relatively common illness may be approximated by a binomial distribution, whereas the distributions of continuous variables (blood pressure, heart rate) are often considered to be normally distributed.
Probability distributions are characterized by parameters, i.e., quantities that allow us to calculate the probabilities of various events concerning the variable, or that allow us to determine the value of probability for a particular value. The binomial distributions has two parameters. It occurs when a fixed number of subjects are observed, the characteristic is dichotomous in nature (only two possible values), and each subject has the same probability (p) of having one value and (1-p) the other value.
The normal distribution on the other hand is a mathematical curve represented by two quantities ,m and s. The former represents the mean of the values of the variables, and latter, the standard deviation. The type of statistical analyses done depends on the design of the study.
Collection of data
In scientific research work data is collected only from personal experimental study. i.e., primary data is used. Statistical data can be collected on two ways.1. census method2. sampling methodCensus method In this method the data is collected from all the individual items that are connected with the inquiry.
Advantages of census methodi. The data has high degree of accuracy.
ii. The data is more representative and true.iii. Results are more reliable.iv. Possibility of bias is minimised.
Disadvantage of census methodi. It is less economical as it consumes more time, more energy and more
expenditure.ii. It requires organizational skills and large number of investigators.
iii. It cannot be applied to all the situations, e.g to determine the blood cell count it is not possible to analyse the whole blood.
Sampling methodIn this method the data is collected from a small group of population which
is termed as sample. A sample is a portion of the population selected to represent the population.
Types of samplesThere are two types of samples which are used in biostatistics :
1. Qualitative samples: when we say that children from African population are taller than those in India it is called as qualitative sample.
2. Quantitative samples: when we try to know the number of decayed teeth of individuals of particular age group then it is called quantitative sample.
Size of samples
The total number of units which are used in the study to get significant results is termed as sample size. To select the proper sample size is very important. The sample size should not be very small or very large because the conclusions are directly affected by it. Advantages of the sampling method
This method is comparatively more economical as it consumes less energy, less time and less expenditure.
It requires less number of investigations. It is most suited to those places and situations where census method cannot
be applied.
Disadvantages of sampling method It requires services of experts, otherwise incorrect or misleading results will
be obtained. In this method selection of appropriate method of sampling is necessary.
If the population is very small and we need precise information then the census method is preferred. If the population is very large or the field of investigation is very wide and the quick results are required, sampling methods should be used.
Types of sampling methodsThere are two types of sampling techniques:
a. Random or probability sampling1. Simple randomized sampling2. Stratified randomized sampling 3. Systematic sampling4. Cluster sampling5. Multistage sampling
b. Nonrandom or nonprobability sampling1. Convenience sampling2. Purposive sampling3. Quota sampling
Random or probability samplingIn random sampling a sample is selected in such a way that every element in
the population has an equal opportunity of being included in the sample. It means random sampling is made without deliberate discrimination. Random sampling is carried out to ascertain a particular character of the population. It involves unbiased or non preferential samples.Selection of random samplesSampling without replacementIn this type of sampling an observation is included only once and is selected randomly without any preference or conscious effort.
Sampling with replacementIn this type of sampling the observation has a chance to be selected at each draw.
Properties of random samples: The several samples drawn from the same population will differ, i.e. their
statistical characteristics will change from sample to sample. Random sample should be large, because larger the sample, lesser will be
the variation of characteristics of the sample from one random sample to another.
A random sample must be selected in such a way that every element in the population had an equal opportunity of being included in the sample.
Advantages of random samplingThe main advantages of random sampling are:
1. The random sampling enables the researcher to draw inferences about the whole population.
2. It eliminates personal bias. The researcher cannot reject those observations which do not support his theory. Similarly, the researcher cannot select only those observations which may support his theory.
Types of random sampling methods1. Simple randomized sampling2. Stratified randomized sampling 3. Systematic sampling4. Cluster sampling5. Multistage sampling6.
Simple random samplingIn this method samples are chosen at random and each member or sample
unit of the population has an equal chance of being selected in the sample. This method is well applicable when the population is small, homogenous and readily available. This method is sometimes is called unrestricted random sampling.
SimpleRandomsampling
o Every possible sample of a certain size within a population has a Known probability of being chosen Equal probability of being chosen
o Most basic type of probability sampling.o Actual selection is done by randomly picking the desired number of units
from the population.o Statistically equivalent to
Identifying all possible samples of the desired size Picking one of those samples at random
Stratified random samplingSamples are chosen random from different strata of usually different sizes of
a population and are based on a priority information about the variation and site. Stratified random sampling is done in heterogeneous populations, i.e., this procedure is followed when population is not homogenous. A heterogeneous population is divided into several more or less homogenous sections or groups. These are called strata. A sample is drawn from each stratum by simple random sampling. Thus the variability in each stratum is adequately represented in the sample also.
StratifiedRandomsampling
o Probability sampling procedureo The chosen sample is forced to contain units from each of the
segments, or strata, of the populationo AKA proportional or quota random sampling involves
dividing population into homogeneous subgroups Take a simple random sample in each subgroup.
o Statistically more efficiento Provides a more accurate population estimate variables.o Two types of stratified random sampling
proportionate disproportionate.
Systematic sampling
This is a simple procedure and utilized when a complete list of population from which a sample is to be drawn is available. It is more often applied to field studies when population is large, scattered and heterogeneous. In this sampling method, samples are drawn evenly spaced after a random start position A is chosen. From a large population, samples are selected every 10th, 20th, 25th or 50th item.
Cluster samplingIn this method the population is divided into separate natural groups of
elements. These groups are called clusters. Each cluster includes only one type of elements. A simple random sample is taken from each cluster. A cluster may consist of units such as villages, wards, blocks, factories, slums of a town, children of a school, etc.
Generally the clusters are natural groupings and if they are geographic regions, the sampling is called as 'area sampling'.
Clustersampling
o Probability sampling procedureo Clusters of population units are selected at randomo All or some units in the chosen clusters are studied.
o When an adequate sampling frame of individual population units is not readily available, cluster sampling is helpful.
o Even when such a sampling frame is available, if the frame can be conveniently divided into a series of representative clusters, a cluster sampling approach may be easier to use than a simple or stratified random-sampling approach.
In cluster sampling, we follow these steps:
o divide population into clusters (usually along geographic boundaries)o randomly sample clusters (areas in red)o measure all units within sampled clusters
Multistage samplingIn multistage sampling the clusters or segments are selected in the primary
cluster sample and these secondary clusters are again sampled instead of being fully inspected. This procedure is employed in large scale country.
Systematicsampling
o Researcher selects the first unit randomlyo The remaining units systematically
o number the units in the population from 1 to No Decide on the n (sample size) that you want or needo k = N/n = the interval sizeo Randomly select an integer between 1 to ko Take every kth unit
o simplicity relative to the other methodso Requires only one random number to select a sample.o Statistical efficiency is practically equivalent
5.Practical Considerations: Probability Sampling Methodso Not all methods may be equally practical in any project.o Base choice upon
Nature of the population Degree of precision desired
Resources available for research.Nonrandom or nonprobability sampling
Nonprobabilitysampling
o Subjective procedureo Probability of selection for the population units cannot
be determined.
o The selection is not done on a strictly chance basiso Offers researchers greater freedom and flexibility in sampling.o Nonprobability samples
Cannot depend upon the rationale of probability theory. May or may not represent the population well Difficult to know how well we've done so.
o There may be circumstances where random sampling is not feasible practical theoretically sensible.
In non random sampling, the samples are drawn without following any crtiteria or any yardstick. The sample collected does not show any specific approach nor the samples can be used to assess properly the accuracy of the estimator. In this sampling procedure many investigator biases are likely to occur. This is of three types :
1. Accidental, Haphazard or Convenience Sampling: this is known as accidental accessibility or haphazard sampling. The major reason is administrative convenience. The sample chosen with ease of access being the sole concern.
conveniencesampling
o a researcher's convenience forms the basis for selecting a sample of units
o Very popular in online research, and is known as intercept sampling or pop-up surveys.
o Traditional "man on the street" interviews conducted frequently by television news programs
o Use of college students in much psychological research is primarily a matter of convenience.
o Many research projects simply ask for volunteers.
2. Judgement/Purposive sampling: this is also known as judgemental sampling. The experimenter exercises deliberate subjective choice in drawing the representative sample. The judgemental random sampling aims at elimination of anticipated sources of distortion, but there will always remain the risk of distortion due to personal prejudices or lack of knowledge of certain crucial features in the structure of population.
Judgmentsampling
o researcher exerts some effort in selecting a sample that is believed to be most appropriate.
Researcher will usually be knowledgeable about the nature of the ideal population. Requires greater researcher effort Generally more appropriate than a convenience sample. Can be very useful
when you need to reach a targeted sample quickly when sampling for proportionality is not the primary concern.
Likely to yield opinions of your target population Likely to overweight more readily accessible subgroups.
3. Quota sampling: this combines convenience and judgement and is more structured than either of the two. Quota sampling needs a proper statistical design to determine what numbers are needed in each of the quotas.
Quotasampling
o sampling a quota of unitso selected from each population cello based on the judgment
Most refined form of nonprobability sampling Often used in practice, especially in personal interviewing. Resembles stratified random sampling Features of judgment and convenience sampling as well. Select people nonrandomly according to some fixed quota.
proportionalquota
samplisng
o Represent the major characteristics of the populationo Sample a proportional amount of each.o Example
Population has 40% women and 60% men Required sample size of 100 Continue sampling until achieving the percentages Then stop.
non proportionalquota
sampling
Specify the minimum number of sampled units you want in each category. Not concerned with numbers that match the proportions in the population. Simply need enough to assure the ability to talk about even small groups in the population. Nonprobabilistic analogue of stratified random sampling Typically used to assure that smaller groups are adequately represented
4. Heterogeneity Sampling aka Sampling for diversityo To include all opinions or viewso Not concerned about representing these views proportionately.o Obtain a broad spectrum of ideaso Not identifying the "average" or "modal instance" ones.o Sampling ideas not people.o To get all of the ideas (especially the unusual ones)o Include a broad and diverse range of participants.
5. Snowball Samplingo Identifying someone who meets the criteria for inclusiono Ask them to recommend others who
they may know
also meet the criteria.o Useful when trying to reach populations that are inaccessible or hard to find.
A. Sampling Error
Sampling error
o The difference between a statistic value generated through sampling ando The parameter value, which can be determined only through a census study
o Magnitude of the sampling error says how precisely the population parameter can be estimated from a sample valueo Estimate the average amount of sampling error associated with a given sampling procedure.o True population parameter value is unknowno Sample statistic value may vary from sample to sample within the population
PRESENTATION OF DATAObjective of classification of data :
make the data simple,
concise, meaningful,
interesting and
helpful in further analysis.
two main methods of presenting data:
Tabulation and
Diagrams
TABULATION
classified on the following bases:
Geographical. i.e , area-wise, e.g. cities, districts etc.
Chronological i,e, on the basis of time.
Qualitative i.e according to some attribute.
Quantitative i,e in terms of magnitude.
The two elements of classification are
The variable and
The frequency.
Variable: a name denoting a condition , occurrence or effect that can assume different values
Divided: subgroups ,classes.
have lowest and highest values
Class interval : difference between the upper and lower limit of a class
Eg: in the class 5 -14,
5 - lower limit and 14 - upper limit.
class interval = 14 - 5 =9.
Frequency: is the number of units belonging to each group of the variable.
Frequency distribution table: way of presenting data in the tables
Frequency distribution table
• Title of the table – named at the bottom
• The no of class intervals - between 5 and 20. no rigidity about it.
• The class intervals - at equal width.
• Clearly defined class limits – to avoid ambiguity.
For e.g., 0-4.5-9. 10-14. Etc.
• Clearly defined row and column with the headings
• Units of measurement should be specified.
• If the data is not original, the source of the data should be mentioned at the bottom of the table.
Diagrams:
Extremely useful
attractive to the eyes,
give a bird's eye view of the entire data,
have a lasting impression
TYPES OF DIAGRAMS:
Bar Diagram : qualitative data.
Multiple Bar: qualitative data
Component Bar Diagram: qualitative data.
Proportional Bar Diagram
Histogram: quantitative data of continuous type.
Frequency Polygon: qualitative data
Pie Diagram: qualitative data
Line diagram: qualitative data
Cartograms or Spot Map: geographical distribution of frequencies
Basic rules :
Self explanatory
Simple and consistent with the data.
Values of the variables - on horizontal or X-axis and the frequency - vertical line or Y-axis.
No too many lines on the graph, should not look clumsy.
The scale of presentation – right hand top corner of the graph.
The scale of division of the two axes should be proportional.
The details of the variables and frequencies presented on the axes.
Bar Diagram
Represent qualitative data.
Only one variable.
width of the bar remains the same
The length varies according to the frequency in each category.
Bars: vertical or horizontal.
Limitation:
represent only one classification
cannot be used for comparison
Facilitate comparison of data relating to different time periods and regions.
Multiple Bar:
compare qualitative data with respect to a single variable.
Eg: sex wise or with respect to time or region.
each category of the variable have a set of bars of the same width
corresponding to the different sections without any gap in between the width
and the length corresponds to the frequency.
Component Bar Diagram:
represent qualitative data.
both, the number of cases in major groups as well as the subgroups simultaneously
cases of the major group drawn
each rectangle is divided according to no in the subgroups.
Proportional Bar Diagram:
represent qualitative data.
compare only the proportion of sub-groups between different major groups of observations, then bars are drawn for each group with the same length, either as 1 or 100%. These are then divided according to the sub-group proportion in each major group.
PIE DIAGRAM
The frequency of the group is shown in a circle.
Degree of angle denotes the frequency.
Instead of comparing the length of bar , the areas of segments are compared.
Line diagram:
useful to study changes of values in the variable over time
simplest type
X-axis, - hours, days, weeks, months or years
Y-axis- value of any quantity pertaining to X-axis,
Histogram
quantitative data of continuous type.
bar diagram without gap between the bars.
represents a frequency distribution.
X-axis: the size of an observation is marked. Starting from 0 the limit of each class interval is marked, the width corresponding to the width of the class interval in the frequency distribution.
Y-axis :the frequencies are marked. A rectangle is drawn above each class interval with height proportional to the frequency of that interval.
Frequency Polygon
frequency distribution of quantitative data
compare two or more frequency distributions.
a point is marked over the mid-point of the class interval, corresponding to the frequency.
points are connected by straight lines.
The first point and last point are joined to the midpoint of previous and next class respectively.
SCATTER DIAGRAM
Cartograms or Spot Map
show geographical distribution of frequencies of a characteristic.
PICTOGRAM
The pictures representing the value of items are called pictograms.
It is most useful way of representing data to those people who cannot understand.
Measures of central tendency: single estimate of a series of data that summarizes the data is
known as the parameter and one such parameter is the measure of central tendency.
Objective:
to condense the entire mass of data
to facilitate comparison
Fig.--. Height and Weight of 20 students of CODS
01020304050607080
3 4 5 6 7
Height in feet
Wei
ght i
n K
Gs
Weight
Types:
Arithmetic mean- mathematical estimate.
Median - positional estimate.
Mode- based on frequency.
Properties of central tendency:
should be based on each and every item in the series.
should not be affected by extreme observations (either too small or too large values).
should be capable of further statistical computations.
It should have sampling stability. i..e, if different samples of same size, say 1 ° are picked up from the same population and the measure of central tendency is calculated, they should not differ from each other markedly.
Arithmetic Mean:
simplest measure of central tendency.
Ungrouped data:
Mean = Sum of all the observations of the data
Number of observations in the data
1. Grouped data with range for class interval:
frequencies in a class interval are equally distributed on either side of the mid point of the class interval.
The formula :
X = Σ Xifi
Σ fi
Where,
Xi : midpoint of the class interval, mean
fi : corresponding frequency
2. Grouped data with single value for class interval:
Symbolically,
X = Σ Xifi
Σ fi
Where,
Xi : is grouped variable ,
f i: corresponding frequency
MEDIAN
middle value in a distribution such that one half of the units in the distribution have a value smaller than or equal to the median and one half have a value higher than or equal to the median.
Calculation of Median:
Ungrouped Data:
observations are arranged in the order of magnitude & then the middle value of the observations : median.
Odd number of observations : (n + 1) / 2
Even: the mean of the two middle values
Grouped: total no observations / 2
X = Σ Xi
n
Σ : sigma, means the sum of.
Xi : is the value of each observation in the data,
n: is the number of observations in the data.
MODE
value in a series of observations which occurs with the greatest frequency.
Eg: series on age at eruption of the canine as 6,6,5,7, 8, 6, 7, 5;
6 - mode.
Ill defined mode :
Mode = 3 Median - 2 mean.
Variability & it’s measures
Types –
Biological variability
Real variability
Experimental variability
Biological variability
Normal or natural differences within accepted biological limits
Individual variability
Periodical variability
Class , group or category variability
Real variability
When the difference b/w two readings is more than the defined limits
Due to the external factors
Experimental variability
Errors or variations due to materials & methods
Observer error – Subjective error
Objective error
Instrument error
Sampling error
Measures of variability
Synonyms:
Measures of dispersion
Measures of variation or scatter
Dispersion is the degree of spread or variation of the variable about a central value.
Uses:
Determine reliability of an average
Serve as a basis of control of variability
Comparison of two or more series
Facilitate further statistical analysis
A good measure of dispersion : simple , easy to compute , based on all items , amenable for further analysis and not affected by extreme values.
Of individual observations -
Range
Interquartile range
Mean deviation
Standard deviation
Coefficient of variation
Variability of samples-
Standard error of mean
Standard error of difference b/w 2 means
Standard error of proportion
Difference b/w 2 proportions
Standard error of correlation coefficient
Standard deviation of regression coefficient
Range
Difference between the value of the smallest item and the value of the largest item.
simplest method.
gives no information about the values that lie between the extreme values.
subjected to fluctuations from sample to sample.
Mean deviation
The average of the deviations from the arithmatic mean
M.D= Σ (x-x)
52,44,54,56,60,64,66,76,60,68
41,54,43,45,60,75,77,66,79,60
Standard deviation:
most important and widely used
it is the square root of the mean of the squared deviations from arithmetic mean.
root mean square deviation
Greater the deviation – greater the dispersion
Smaller the deviation- higher degree of uniformity
Calculation of S.D
For ungrouped data:
Calculate the mean = x
Diff of each observation from mean,
d = xi – x
Square these = d²
Total these = Σ d²
Divide this by no of observations minus 1,
variance = d²/ (n-1)
Square root of this variance is
S.D = Σ d²
(n-1)
For grouped data: with single units for class intervals
Make frequency table
Determine mid pt of each range
SD= Σ (Xi- x) 2 fi
n-1
Xi – individual observation in the class
x- mean
fi – frequency
n- total frequency
Calculation for grouped data with range for class interval:
Class intervals in terms of range:
Frequency- -centered in mid points
S = Σ (xi- x) fi
n-1
Xi – -midpoint of class interval
x- mean
fi – frequency
n- total frequency
Uses of standard deviation
Summarizes the deviations , of a large distribution
Indicates whether the variation from mean is by chance or real
Helps in finding standard error
Helps in finding the suitable size of sample
Standard deviation is only interpretable as a summary measure for variations having approximately symmetric preparations
Coefficient of variation
Compare relative variability
Variation of same character in two or more series
compare the variability of one character in two different groups having different magnitude of values or
to compare two characters in the same group by expressing in percentage
C V = S.D x 100
mean
Higher the C.V greater variability
Normal distribution & Normal curve
Height of bars or curve greatest in middle
Values are spread around mean
Maximum values around mean , few at extremes
half values above & half below mean
Properties of the normal Distribution
curve is bell shaped.
The curve is symmetrical about the middle point.
The mean is located at the highest point of the curve
measures of central tendency coincide.
Maximum number of observations is at the value of the variable corresponding to the mean
number of observations gradually decreases on either side with with very few observations at the extreme points.
area under the curve between any 2 pts which correspond to the number of observations between any 2 values of the variate - in terms of a relationship between the mean and the SD:
a) Mean ±1 S.D. covers 68.3% of the observations;
b) Mean ±2 S.D. covers 95.4% of the observations;
c) Mean ±3 S.D. covers 99.7% of the observations.
This relationship is used for fixing confidence interval.
Normal distribution law forms the basis for various tests of significance
Relative or standard normal deviate
Deviation from mean in normal distribution
Measured in terms of S.D
indicates how much an observation is bigger or smaller than means in units of SD
Z = observation – mean
SD
Z= x- x
S
Probability or chance
relative frequency or probable chances of occurrence with which an event is expected to occur on an average
Expressed as ‘p’
Ranges from 0-1
when p= 0, no chance of event happening
When p=1 , 100% chances of event happening
p no of events occurring
total no of trials
Statistical hypothesis
Methods to estimate the difference b/w estimates of samples
two hypothesis are made:
Null hypothesis or hypothesis of no difference
Alternative hypothesis of significant difference
Null hypothesis or hypothesis of no difference [Ho]
Asserts that there is no real difference in sample & general population
The difference found is accidental & arises out of sampling variations
Alternative hypothesis of significant difference [H1]
States that sample result is different than the hypothetical value of population
To minimize errors the sampling distribution or area under normal curve is divided into two regions or zones
1. Zone of acceptance :samples in the area of mean ±1.96 SE, null hypothesis – accepted
2. Zone of rejection: sample in the shaded area is beyond the mean ±1.96 SE, null hypothesis – rejected
Degrees of freedom
The quantity in the denominator which is one less than the independent no of observations in sample.
Eg:
When there are 10 values , 9 choices or degrees of freedom
In unpaired t –test of difference between 2 means
df = n1+n2-2
Where;n1 & n2 are no observations.
In paired t- test df = n-1
Standard error
A measure of variability of the mean sample
Obtained by SD / square root of the sample size
SE = SD
√ n
2 types of errors;
Type 1 error
Type 2 error
Type I error :
– null hypothesis is rejected { when it is true}
Type I error:α
The null hypothesis is rejected even it falls in the zone of acceptance
serious error.
Type II error
– null hypothesis is wrongly accepted
β error
the null hypothesis is accepted even it falls in the zone of rejection
Nullhypothesis is true
Accept it
Correct decision
Reject it
Type I error
Nullhypothesis is false Type II error Correct decision
not serious error, needs only confirmation of result by changing the level of significance
Tests of significance
Parametric and non parametric tests or methodsParametric methods
The methods of statistical inference that are based on the assumption that the population has a certain probability distribution, the resulting collection of statistical tests and procedures are referred to as parametric methods. For example, t- distribution and F-distribution are associated with the values of parameters of an assumed normal probability distribution.Non parametric methods
The statistical procedures that do not require assumptions of any form of probability distribution from which experiments come are known as non parametric methods. These are also called distribution free methods. For example, chi square frequency techniques are non parametric.
Parametric tests
Eg. T test, Z test, Chi-square test,Pearson correlation coifficient
Non parametric tests
Eg. Chi-square test, Kruskal-Wallis test, Spearman correlation coifficient
Tests of significance- Steps involved
Define the problem
state the hypothesis
Null hypothesis
Alternate hypothesis
Fix the level of significance
Select appropriate test to find test statistic
Find degree of freedom (df)
Compare the observed test statistic with theoretical one at desired level of significance & corresponding DF
If the observed test statistic value is greater than the theoretical value, reject the null hypothesis.
Draw the inference based on the level of significance
Objective of using tests of significance
To compare – sample mean with population
Means of two samples
Sample proportion with population
Proportion of two samples
Association b/w two attributes
t - test
Student’s t-test
Designed by W.S Gossett
Unpaired t- test (two independent samples)
Paired t- test ( single sample correlated observation)
Essential conditions:
randomly selected samples from the corresponding populations
Homogeneity of variances in the 2 samples
Quantitative data
Variable normally distributed
samples < 30
Unpaired t- test
Unpaired data of independent observation made on the individual of two different or separate groups or samples drawn from 2 populations
Null hypothesis is stated
difference between means of two samples
(X1-X2) measures variation in variable
calculate the t value
t = (X1-X2)
SE
Paired t- test
To study the role of factor or cause when the observations are made before & after the its play:
Eg: exertion on pulse rate, effect of a drug on blood pressure etc
To compare the effect of 2drugs , given to the same individual in the sample on two different occasions
eg: adrenaline & noradrenaline on pulse rate
to study the comparative accuracy of 2 difft instruments
eg: 2 difft types of sphygmomanometers
to compare the results of 2 difft lab techniques
To compare the observations made at two different sites in the same body
Testing procedure:
Null hypothesis
X1-X2= x
Calculate mean of the difference x = Σ x /n
calculate SD of differences & SE of mean
SE= SD/ √ n
Determine t value
t = x -o
SD / √n
• Find the degrees of freedom , n-1
• refer the table & find the probability
• P >0.05 not significant
• P< 0.05 significant
Variance ratio test or F test
Variance: a measure of the extent of the variation present in a set of data
Obtained by taking the sum of squares
Measured in squared units
Comparison of variance b/w two samples
Test developed by Fisher & Snedecor
Involves another distribution called F – distribution
Calculate variance of two samples first, S1 2 & S2 2,
(Variance = SD²)
F = S12 / S2
2
S12 > S2
2
S12 - numerator
Significance of F by referring to F- table
Degrees of freedom , (n1 – 1 ) & (n2 – 1) in the two samples
Table gives variance ratio values at diff levels of significance at df (n1 – 1) given horizontally and (n2 – 2) , vertically
Eg sample A : sum of squares = 36 ; df = 8
Sample B : sum of squares = 42 : df = 9
F = 42/9 / 36 /8 = 42/9 x 8/36 = 1.04
This value of F < table value at p =0.05, not significant
Analysis of variance(ANOVA) test
Compare more than two samples
Compares variation between the classes as well as within the classes
For such comparisons there is high chance of error using t or Z test
Variation in experimental studies – natural variation/ random / error variation
Variation caused due to experimenter- imposed variation or treatment variation
A :b/w groups variation = random variation (always) + imposed variation (maybe)
B :Within group variation = random variation
Total variation = A+B
If there is no real difference b/w groups, then
between treatment = random variation
Within treatment random variation
If there is any real difference b/n the R/
between treatment = random variation+ imposed variation > 1
Within treatment random variation
Chi square test ( χ ² test )
Non parametric test
Developed by Karl Pearson
Not based on normal distribution of any variable
Used for qualitative data
To test whether the difference in distribution of attributes in different groups is due to sampling variation or otherwise.
Applications
1. Test for goodness of fit
2. Test of association (independence)
3. Test of homogeneity or population variance
c2 test is non parametric in the first two cases and parametric in the third case
Calculation of χ² value
Three requirements –
A random sample
Qualitative data
Lowest expected frequency > 5
χ² = (observed f – expected f )²
Expected f
Expected f = row total x column total / grand total
df =( r-1)x (c-1)
Calculated value is correlated with table
Drawbacks :
Tells us about the association but fails to measure the strength of association.
Test is unreliable if the expected frequency in any one cell
is less than 5.
Correction is done by subtracting 0.5 from [ 0-E ]
― Yates’s correction
For Tables larger that 2 x 2 , Yates correction cannot be applied
Not applicable when there is 0 or 1 in any of the cells [ Resort to Fisher’s exact probability test ]
χ² values interpreted with caution when sample < 50
Non parametric tests:
A family of statistical tests also called as distribution free tests that do not require any assumption about the distribution the data set follows and that do not require the testing of distribution parameters such as means or variances.
Friedman’s test – nonparametric equivalent of analysis of variance
Kruskal – Wallis test – to compare medians of several independent samples equivalent of one –way analysis of variance
Mann – Whitney U test – compare medians of two independent samples. Equivalent of t test
McNemar’s test variant of chi squared test , used when data is paired
Wilcoxon’s Sign rank test – paired data
Spearman’s rank correlation – correlation coefficient
CONCLUSION
Its more important to understand the indications and limitations of various statistical tests rather than the robust mathematical calculations since the latter is taken care of by the software like SPSS
Understanding the classification of data is crucial for the selection of appropriate test of significance
REFERENCES
B.K. Mahajan. Methods in Biostatistics, 6th edition, Jaypee brothers
P.S.S.Sundar Rao, J.Richard. An introduction to Biostatistics,3rd edition, Prentice Hall of India.
James F Jekel, David L Katz, Joann G Elmore. Epidemiology, biostatistics and preventive medicine, 2nd edition, WB Saunders Company
Research methodology- C.R.Kothari.
Preventive and Community Dentistry- Soben Peter 4th edition.