Biostatistics Word New

BIOSTATISTICS

Statistics deals with

• Planning Research

• Collecting Data

• Describing Data

• Summarizing- Presenting Data

• Analyzing Data

• Interpreting Results

• Reaching decisions or discovering new knowledge

Biostatisitcs is the application of statistical methods to health sciences.

DEFINITION:

It is the method of collection, organizing , analysing, tabulating and interpretation of datas related to living organisms and human beings. [Soben Peter]

History

John Graunt (1620-1674) , who was neither a physician nor a mathematician is the father of Health Sciences. The term biometry was coined by W.F.R Weldon (1860-1906), a zoologist at University College, London.

Use of biostatistics

To test whether the difference between two populations is real or a chance occurrence.

To study the correlation between attributes in the same population. To evaluate the effect of vaccines, sera etc. To measure mortality and morbidity To evaluate achievements of public health programs. To fix priorities in public health programs. To help promote health legislation and create administrative standards for

oral health.

Aims of biostatistics To generate the statistical data through experimental investigation and

sample surveys. To organize and represent the data in suitable tables, diagrams, charts or

graphs, etc To draw valid inferences from the data collected, put forth definite

interpretations or predict the future outcomes from the data.

Why should medical/Dental students learn biostatistics?

1. Medicine is becoming increasingly QUANTITATIVE.

• The aim is to improve the Health Status of the population.

• We have to clarify the relationships between certain factors and diseases.

• Enumarate the occurances of diseases

• Explain the etiology of diseases (which factors cause which diseases)

• Predict the number of disease occurence

• Read, understand and criticize the medical literature.

2. The planning, conduct and interpretation of much of medical research are becoming increasingly reliant on statistical methods.

Planning

3. How many patients must be treated?

4. How do we have to allocate the subjects to treatments?

5. What are the other factors which may influence the response variable?

Conduct:

Under which conditions must the study be conducted?

Is matching necessary?

Is blinding (single blinding or double blinding) necessary?

Is there a need for a control group?

Shoud the placebo effect be considered?

Which experimental design technique is more appropriate?

Interpretation:

Example:

Distribution of Women with a Diagnosis of Tromboembolism Among Blood Groups

Blood Group Frequency %

A 32 58.2

AB 4 7.3

B 8 14.5

O 11 20.0

Total 55 100.0

Terminologies:

Data : Set of values of one or more variables recorded on one or more observational units.

Observation (case): Individual source of data.

Variable: This is a quantity which varies such that it may take any one of a specified set of values. It may be measurable or non-measurable.

Population: A collection, or set, of individuals, objects, or measurements whose properties are to be analyzed.

Sample: A subset of the population, selected in such a way that it is representative of the larger population.

Parameter : A summary value which in some way characterizes the nature of the population in the variable under study.

Statistic : A summary value calculated from a sample of observation.

DATA:

Sources of data

1. Routinely kept records

2. Published data sources

3. Data on electronic media

4. Surveys and Experimental research

5. Census

6. Generated or artificial data

Types of Data

1. Qualitative Data

Results from a variable that asks for a quality type of description of the subject.

2. Quantitative Data

Results from obtaining quantities-counts or measurements.

Types of biostatisticsDescriptive biostatistics Inferential biostatistics

Descriptive biostatistics:It is the study of biostatistical procedures which deal with the collection,

representation, calculation and processing, i.e., the summarization of the data to make it more informative and comprehensible. The primary function of descriptive statistics is to provide meaningful and convenient techniques for describing features of data that are of interest. The failure to choose appropriate descriptive statistics often lead to faulty scientific inference. The field of descriptive statistics is not concerned with the implications or conclusions that can be drawn from the sets of the data.

Inferential biostatistics:It constitutes the procedures which serve to make generalization or drawing

conclusions on the basis of the studies of the sample. This is also known as sampling biostatistics. The study of the quantitative aspects of the inferential process provides a solid basis, on which the more general substantive process of inference can be founded.

Basis for statistical analysesStatistical analyses are based on three primary entities:

The population 'U' that is of interest The set of characteristics (variables) of units of this population 'V' The probability distribution 'P' of these characteristics in the population

The population 'U'The population is a collection of units of observation that are of interest and

is the target of the investigation. For example , in determining the effectiveness of a particular drug for a disease, the population would consist of all possible patients with the disease. It is essential, in any research study, to identify the population clearly and precisely. The success of the investigation will depend to a large extent on the identification of the population of interest.

The variables 'V'A variable is a state, condition, concept or event whose value is free to vary

within the population. Once the population is identified, we should clearly define what

characteristics of the units of this population (subjects of the study) are we planning to investigate.

For example in the case of a particular drug, one needs to define the disease and what other characteristics of the people (e.g. age, sex, education, etc.) one intends to study.

Clear and precise definitions and methods for measuring these characteristics (a simple observation, a laboratory measurement, or tests using a questionnaire) are essential for the success of the research study.

Variables can be classified as ,

Independent variables: variables that are manipulated or treated in a study in order to see what effect, differences in them will have on those variables proposed as being dependent on them. Synonyms: cause, input, predisposing factor, antecedent, risk factor, characteristic, attribute, determinant.

Dependent variables: variables in which changes are results of the level or amount of the independent variable or variables. Synonyms: effect, outcome, consequence, result, condition, disease.

Confounding or intervening variables: variables that should be studied because they may influence or 'confound' the effect of the independent variables on the dependent variables. E.g. the study of tobacco (independent variable) on oral cancer (dependent variable), the nutritional status of the individual may play an intervening role.

Background variables: variables that are so often of relevance in investigations of the groups or populations that they should be considered for possible inclusion in the study. Synonyms: sex, age, ethnic origin, education, marital status, social status.

The probability distribution 'P'The probability distribution is a way to enumerate the different values the

variable can have, and how frequently each value appears in the population. The actual frequency distribution is approximated to a theoretical curve that is used as the probability distribution. Common examples of probability distributions are binomial and normal.

For e.g. the incidence of a relatively common illness may be approximated by a binomial distribution, whereas the distributions of continuous variables (blood pressure, heart rate) are often considered to be normally distributed.

Probability distributions are characterized by parameters, i.e., quantities that allow us to calculate the probabilities of various events concerning the variable, or that allow us to determine the value of probability for a particular value. The binomial distributions has two parameters. It occurs when a fixed number of subjects are observed, the characteristic is dichotomous in nature (only two possible values), and each subject has the same probability (p) of having one value and (1-p) the other value.

The normal distribution on the other hand is a mathematical curve represented by two quantities ,m and s. The former represents the mean of the values of the variables, and latter, the standard deviation. The type of statistical analyses done depends on the design of the study.

Collection of data

In scientific research work data is collected only from personal experimental study. i.e., primary data is used. Statistical data can be collected on two ways.1. census method2. sampling methodCensus method In this method the data is collected from all the individual items that are connected with the inquiry.

Advantages of census methodi. The data has high degree of accuracy.

ii. The data is more representative and true.iii. Results are more reliable.iv. Possibility of bias is minimised.

Disadvantage of census methodi. It is less economical as it consumes more time, more energy and more

expenditure.ii. It requires organizational skills and large number of investigators.

iii. It cannot be applied to all the situations, e.g to determine the blood cell count it is not possible to analyse the whole blood.

Sampling methodIn this method the data is collected from a small group of population which

is termed as sample. A sample is a portion of the population selected to represent the population.

Types of samplesThere are two types of samples which are used in biostatistics :

1. Qualitative samples: when we say that children from African population are taller than those in India it is called as qualitative sample.

2. Quantitative samples: when we try to know the number of decayed teeth of individuals of particular age group then it is called quantitative sample.

Size of samples

The total number of units which are used in the study to get significant results is termed as sample size. To select the proper sample size is very important. The sample size should not be very small or very large because the conclusions are directly affected by it. Advantages of the sampling method

This method is comparatively more economical as it consumes less energy, less time and less expenditure.

It requires less number of investigations. It is most suited to those places and situations where census method cannot

be applied.

Disadvantages of sampling method It requires services of experts, otherwise incorrect or misleading results will

be obtained. In this method selection of appropriate method of sampling is necessary.

If the population is very small and we need precise information then the census method is preferred. If the population is very large or the field of investigation is very wide and the quick results are required, sampling methods should be used.

Types of sampling methodsThere are two types of sampling techniques:

a. Random or probability sampling1. Simple randomized sampling2. Stratified randomized sampling 3. Systematic sampling4. Cluster sampling5. Multistage sampling

b. Nonrandom or nonprobability sampling1. Convenience sampling2. Purposive sampling3. Quota sampling

Random or probability samplingIn random sampling a sample is selected in such a way that every element in

the population has an equal opportunity of being included in the sample. It means random sampling is made without deliberate discrimination. Random sampling is carried out to ascertain a particular character of the population. It involves unbiased or non preferential samples.Selection of random samplesSampling without replacementIn this type of sampling an observation is included only once and is selected randomly without any preference or conscious effort.

Sampling with replacementIn this type of sampling the observation has a chance to be selected at each draw.

Properties of random samples: The several samples drawn from the same population will differ, i.e. their

statistical characteristics will change from sample to sample. Random sample should be large, because larger the sample, lesser will be

the variation of characteristics of the sample from one random sample to another.

A random sample must be selected in such a way that every element in the population had an equal opportunity of being included in the sample.

Advantages of random samplingThe main advantages of random sampling are:

1. The random sampling enables the researcher to draw inferences about the whole population.

2. It eliminates personal bias. The researcher cannot reject those observations which do not support his theory. Similarly, the researcher cannot select only those observations which may support his theory.

Types of random sampling methods1. Simple randomized sampling2. Stratified randomized sampling 3. Systematic sampling4. Cluster sampling5. Multistage sampling6.

Simple random samplingIn this method samples are chosen at random and each member or sample

unit of the population has an equal chance of being selected in the sample. This method is well applicable when the population is small, homogenous and readily available. This method is sometimes is called unrestricted random sampling.

SimpleRandomsampling

o Every possible sample of a certain size within a population has a Known probability of being chosen Equal probability of being chosen

o Most basic type of probability sampling.o Actual selection is done by randomly picking the desired number of units

from the population.o Statistically equivalent to

Identifying all possible samples of the desired size Picking one of those samples at random

Stratified random samplingSamples are chosen random from different strata of usually different sizes of

a population and are based on a priority information about the variation and site. Stratified random sampling is done in heterogeneous populations, i.e., this procedure is followed when population is not homogenous. A heterogeneous population is divided into several more or less homogenous sections or groups. These are called strata. A sample is drawn from each stratum by simple random sampling. Thus the variability in each stratum is adequately represented in the sample also.

StratifiedRandomsampling

o Probability sampling procedureo The chosen sample is forced to contain units from each of the

segments, or strata, of the populationo AKA proportional or quota random sampling involves

dividing population into homogeneous subgroups Take a simple random sample in each subgroup.

o Statistically more efficiento Provides a more accurate population estimate variables.o Two types of stratified random sampling

proportionate disproportionate.

Systematic sampling

This is a simple procedure and utilized when a complete list of population from which a sample is to be drawn is available. It is more often applied to field studies when population is large, scattered and heterogeneous. In this sampling method, samples are drawn evenly spaced after a random start position A is chosen. From a large population, samples are selected every 10th, 20th, 25th or 50th item.

Cluster samplingIn this method the population is divided into separate natural groups of

elements. These groups are called clusters. Each cluster includes only one type of elements. A simple random sample is taken from each cluster. A cluster may consist of units such as villages, wards, blocks, factories, slums of a town, children of a school, etc.

Generally the clusters are natural groupings and if they are geographic regions, the sampling is called as 'area sampling'.

Clustersampling

o Probability sampling procedureo Clusters of population units are selected at randomo All or some units in the chosen clusters are studied.

o When an adequate sampling frame of individual population units is not readily available, cluster sampling is helpful.

o Even when such a sampling frame is available, if the frame can be conveniently divided into a series of representative clusters, a cluster sampling approach may be easier to use than a simple or stratified random-sampling approach.

In cluster sampling, we follow these steps:

o divide population into clusters (usually along geographic boundaries)o randomly sample clusters (areas in red)o measure all units within sampled clusters

Multistage samplingIn multistage sampling the clusters or segments are selected in the primary

cluster sample and these secondary clusters are again sampled instead of being fully inspected. This procedure is employed in large scale country.

Systematicsampling

o Researcher selects the first unit randomlyo The remaining units systematically

o number the units in the population from 1 to No Decide on the n (sample size) that you want or needo k = N/n = the interval sizeo Randomly select an integer between 1 to ko Take every kth unit

o simplicity relative to the other methodso Requires only one random number to select a sample.o Statistical efficiency is practically equivalent

5.Practical Considerations: Probability Sampling Methodso Not all methods may be equally practical in any project.o Base choice upon

Nature of the population Degree of precision desired

Resources available for research.Nonrandom or nonprobability sampling

Nonprobabilitysampling

o Subjective procedureo Probability of selection for the population units cannot

be determined.

o The selection is not done on a strictly chance basiso Offers researchers greater freedom and flexibility in sampling.o Nonprobability samples

Cannot depend upon the rationale of probability theory. May or may not represent the population well Difficult to know how well we've done so.

o There may be circumstances where random sampling is not feasible practical theoretically sensible.

In non random sampling, the samples are drawn without following any crtiteria or any yardstick. The sample collected does not show any specific approach nor the samples can be used to assess properly the accuracy of the estimator. In this sampling procedure many investigator biases are likely to occur. This is of three types :

1. Accidental, Haphazard or Convenience Sampling: this is known as accidental accessibility or haphazard sampling. The major reason is administrative convenience. The sample chosen with ease of access being the sole concern.

conveniencesampling

o a researcher's convenience forms the basis for selecting a sample of units

o Very popular in online research, and is known as intercept sampling or pop-up surveys.

o Traditional "man on the street" interviews conducted frequently by television news programs

o Use of college students in much psychological research is primarily a matter of convenience.

o Many research projects simply ask for volunteers.

2. Judgement/Purposive sampling: this is also known as judgemental sampling. The experimenter exercises deliberate subjective choice in drawing the representative sample. The judgemental random sampling aims at elimination of anticipated sources of distortion, but there will always remain the risk of distortion due to personal prejudices or lack of knowledge of certain crucial features in the structure of population.

Judgmentsampling

o researcher exerts some effort in selecting a sample that is believed to be most appropriate.

Researcher will usually be knowledgeable about the nature of the ideal population. Requires greater researcher effort Generally more appropriate than a convenience sample. Can be very useful

when you need to reach a targeted sample quickly when sampling for proportionality is not the primary concern.

Likely to yield opinions of your target population Likely to overweight more readily accessible subgroups.

3. Quota sampling: this combines convenience and judgement and is more structured than either of the two. Quota sampling needs a proper statistical design to determine what numbers are needed in each of the quotas.

Quotasampling

o sampling a quota of unitso selected from each population cello based on the judgment

Most refined form of nonprobability sampling Often used in practice, especially in personal interviewing. Resembles stratified random sampling Features of judgment and convenience sampling as well. Select people nonrandomly according to some fixed quota.

proportionalquota

samplisng

o Represent the major characteristics of the populationo Sample a proportional amount of each.o Example

Population has 40% women and 60% men Required sample size of 100 Continue sampling until achieving the percentages Then stop.

non proportionalquota

sampling

Specify the minimum number of sampled units you want in each category. Not concerned with numbers that match the proportions in the population. Simply need enough to assure the ability to talk about even small groups in the population. Nonprobabilistic analogue of stratified random sampling Typically used to assure that smaller groups are adequately represented

4. Heterogeneity Sampling aka Sampling for diversityo To include all opinions or viewso Not concerned about representing these views proportionately.o Obtain a broad spectrum of ideaso Not identifying the "average" or "modal instance" ones.o Sampling ideas not people.o To get all of the ideas (especially the unusual ones)o Include a broad and diverse range of participants.

5. Snowball Samplingo Identifying someone who meets the criteria for inclusiono Ask them to recommend others who

they may know

also meet the criteria.o Useful when trying to reach populations that are inaccessible or hard to find.

A. Sampling Error

Sampling error

o The difference between a statistic value generated through sampling ando The parameter value, which can be determined only through a census study

o Magnitude of the sampling error says how precisely the population parameter can be estimated from a sample valueo Estimate the average amount of sampling error associated with a given sampling procedure.o True population parameter value is unknowno Sample statistic value may vary from sample to sample within the population

PRESENTATION OF DATAObjective of classification of data :

make the data simple,

concise, meaningful,

interesting and

helpful in further analysis.

two main methods of presenting data:

Tabulation and

Diagrams

TABULATION

classified on the following bases:

Geographical. i.e , area-wise, e.g. cities, districts etc.

Chronological i,e, on the basis of time.

Qualitative i.e according to some attribute.

Quantitative i,e in terms of magnitude.

The two elements of classification are

The variable and

The frequency.

Variable: a name denoting a condition , occurrence or effect that can assume different values

Divided: subgroups ,classes.

have lowest and highest values

Class interval : difference between the upper and lower limit of a class

Eg: in the class 5 -14,

5 - lower limit and 14 - upper limit.

class interval = 14 - 5 =9.

Frequency: is the number of units belonging to each group of the variable.

Frequency distribution table: way of presenting data in the tables

Frequency distribution table

• Title of the table – named at the bottom

• The no of class intervals - between 5 and 20. no rigidity about it.

• The class intervals - at equal width.

• Clearly defined class limits – to avoid ambiguity.

For e.g., 0-4.5-9. 10-14. Etc.

• Clearly defined row and column with the headings

• Units of measurement should be specified.

• If the data is not original, the source of the data should be mentioned at the bottom of the table.

Diagrams:

Extremely useful

attractive to the eyes,

give a bird's eye view of the entire data,

have a lasting impression

TYPES OF DIAGRAMS:

Bar Diagram : qualitative data.

Multiple Bar: qualitative data

Component Bar Diagram: qualitative data.

Proportional Bar Diagram

Histogram: quantitative data of continuous type.

Frequency Polygon: qualitative data

Pie Diagram: qualitative data

Line diagram: qualitative data

Cartograms or Spot Map: geographical distribution of frequencies

Basic rules :

Self explanatory

Simple and consistent with the data.

Values of the variables - on horizontal or X-axis and the frequency - vertical line or Y-axis.

No too many lines on the graph, should not look clumsy.

The scale of presentation – right hand top corner of the graph.

The scale of division of the two axes should be proportional.

The details of the variables and frequencies presented on the axes.

Bar Diagram

Represent qualitative data.

Only one variable.

width of the bar remains the same

The length varies according to the frequency in each category.

Bars: vertical or horizontal.

Limitation:

represent only one classification

cannot be used for comparison

Facilitate comparison of data relating to different time periods and regions.

Multiple Bar:

compare qualitative data with respect to a single variable.

Eg: sex wise or with respect to time or region.

each category of the variable have a set of bars of the same width

corresponding to the different sections without any gap in between the width

and the length corresponds to the frequency.

Component Bar Diagram:

represent qualitative data.

both, the number of cases in major groups as well as the subgroups simultaneously

cases of the major group drawn

each rectangle is divided according to no in the subgroups.

Proportional Bar Diagram:

represent qualitative data.

compare only the proportion of sub-groups between different major groups of observations, then bars are drawn for each group with the same length, either as 1 or 100%. These are then divided according to the sub-group proportion in each major group.

PIE DIAGRAM

The frequency of the group is shown in a circle.

Degree of angle denotes the frequency.

Instead of comparing the length of bar , the areas of segments are compared.

Line diagram:

useful to study changes of values in the variable over time

simplest type

X-axis, - hours, days, weeks, months or years

Y-axis- value of any quantity pertaining to X-axis,

Histogram

quantitative data of continuous type.

bar diagram without gap between the bars.

represents a frequency distribution.

X-axis: the size of an observation is marked. Starting from 0 the limit of each class interval is marked, the width corresponding to the width of the class interval in the frequency distribution.

Y-axis :the frequencies are marked. A rectangle is drawn above each class interval with height proportional to the frequency of that interval.

Frequency Polygon

frequency distribution of quantitative data

compare two or more frequency distributions.

a point is marked over the mid-point of the class interval, corresponding to the frequency.

points are connected by straight lines.

The first point and last point are joined to the midpoint of previous and next class respectively.

SCATTER DIAGRAM

Cartograms or Spot Map

show geographical distribution of frequencies of a characteristic.

PICTOGRAM

The pictures representing the value of items are called pictograms.

It is most useful way of representing data to those people who cannot understand.

Measures of central tendency: single estimate of a series of data that summarizes the data is

known as the parameter and one such parameter is the measure of central tendency.

Objective:

to condense the entire mass of data

to facilitate comparison

Fig.--. Height and Weight of 20 students of CODS

01020304050607080

3 4 5 6 7

Height in feet

Wei

ght i

n K

Gs

Weight

Types:

Arithmetic mean- mathematical estimate.

Median - positional estimate.

Mode- based on frequency.

Properties of central tendency:

should be based on each and every item in the series.

should not be affected by extreme observations (either too small or too large values).

should be capable of further statistical computations.

It should have sampling stability. i..e, if different samples of same size, say 1 ° are picked up from the same population and the measure of central tendency is calculated, they should not differ from each other markedly.

Arithmetic Mean:

simplest measure of central tendency.

Ungrouped data:

Mean = Sum of all the observations of the data

Number of observations in the data

1. Grouped data with range for class interval:

frequencies in a class interval are equally distributed on either side of the mid point of the class interval.

The formula :

X = Σ Xifi

Σ fi

Where,

Xi : midpoint of the class interval, mean

fi : corresponding frequency

2. Grouped data with single value for class interval:

Symbolically,

X = Σ Xifi

Σ fi

Where,

Xi : is grouped variable ,

f i: corresponding frequency

MEDIAN

middle value in a distribution such that one half of the units in the distribution have a value smaller than or equal to the median and one half have a value higher than or equal to the median.

Calculation of Median:

Ungrouped Data:

observations are arranged in the order of magnitude & then the middle value of the observations : median.

Odd number of observations : (n + 1) / 2

Even: the mean of the two middle values

Grouped: total no observations / 2

X = Σ Xi

n

Σ : sigma, means the sum of.

Xi : is the value of each observation in the data,

n: is the number of observations in the data.

MODE

value in a series of observations which occurs with the greatest frequency.

Eg: series on age at eruption of the canine as 6,6,5,7, 8, 6, 7, 5;

6 - mode.

Ill defined mode :

Mode = 3 Median - 2 mean.

Variability & it’s measures

Types –

Biological variability

Real variability

Experimental variability

Biological variability

Normal or natural differences within accepted biological limits

Individual variability

Periodical variability

Class , group or category variability

Real variability

When the difference b/w two readings is more than the defined limits

Due to the external factors

Experimental variability

Errors or variations due to materials & methods

Observer error – Subjective error

Objective error

Instrument error

Sampling error

Measures of variability

Synonyms:

Measures of dispersion

Measures of variation or scatter

Dispersion is the degree of spread or variation of the variable about a central value.

Uses:

Determine reliability of an average

Serve as a basis of control of variability

Comparison of two or more series

Facilitate further statistical analysis

A good measure of dispersion : simple , easy to compute , based on all items , amenable for further analysis and not affected by extreme values.

Of individual observations -

Range

Interquartile range

Mean deviation

Standard deviation

Coefficient of variation

Variability of samples-

Standard error of mean

Standard error of difference b/w 2 means

Standard error of proportion

Difference b/w 2 proportions

Standard error of correlation coefficient

Standard deviation of regression coefficient

Range

Difference between the value of the smallest item and the value of the largest item.

simplest method.

gives no information about the values that lie between the extreme values.

subjected to fluctuations from sample to sample.

Mean deviation

The average of the deviations from the arithmatic mean

M.D= Σ (x-x)

52,44,54,56,60,64,66,76,60,68

41,54,43,45,60,75,77,66,79,60

Standard deviation:

most important and widely used

it is the square root of the mean of the squared deviations from arithmetic mean.

root mean square deviation

Greater the deviation – greater the dispersion

Smaller the deviation- higher degree of uniformity

Calculation of S.D

For ungrouped data:

Calculate the mean = x

Diff of each observation from mean,

d = xi – x

Square these = d²

Total these = Σ d²

Divide this by no of observations minus 1,

variance = d²/ (n-1)

Square root of this variance is

S.D = Σ d²

(n-1)

For grouped data: with single units for class intervals

Make frequency table

Determine mid pt of each range

SD= Σ (Xi- x) 2 fi

n-1

Xi – individual observation in the class

x- mean

fi – frequency

n- total frequency

Calculation for grouped data with range for class interval:

Class intervals in terms of range:

Frequency- -centered in mid points

S = Σ (xi- x) fi

n-1

Xi – -midpoint of class interval

x- mean

fi – frequency

n- total frequency

Uses of standard deviation

Summarizes the deviations , of a large distribution

Indicates whether the variation from mean is by chance or real

Helps in finding standard error

Helps in finding the suitable size of sample

Standard deviation is only interpretable as a summary measure for variations having approximately symmetric preparations

Coefficient of variation

Compare relative variability

Variation of same character in two or more series

compare the variability of one character in two different groups having different magnitude of values or

to compare two characters in the same group by expressing in percentage

C V = S.D x 100

mean

Higher the C.V greater variability

Normal distribution & Normal curve

Height of bars or curve greatest in middle

Values are spread around mean

Maximum values around mean , few at extremes

half values above & half below mean

Properties of the normal Distribution

curve is bell shaped.

The curve is symmetrical about the middle point.

The mean is located at the highest point of the curve

measures of central tendency coincide.

Maximum number of observations is at the value of the variable corresponding to the mean

number of observations gradually decreases on either side with with very few observations at the extreme points.

area under the curve between any 2 pts which correspond to the number of observations between any 2 values of the variate - in terms of a relationship between the mean and the SD:

a) Mean ±1 S.D. covers 68.3% of the observations;

b) Mean ±2 S.D. covers 95.4% of the observations;

c) Mean ±3 S.D. covers 99.7% of the observations.

This relationship is used for fixing confidence interval.

Normal distribution law forms the basis for various tests of significance

Relative or standard normal deviate

Deviation from mean in normal distribution

Measured in terms of S.D

indicates how much an observation is bigger or smaller than means in units of SD

Z = observation – mean

SD

Z= x- x

S

Probability or chance

relative frequency or probable chances of occurrence with which an event is expected to occur on an average

Expressed as ‘p’

Ranges from 0-1

when p= 0, no chance of event happening

When p=1 , 100% chances of event happening

p no of events occurring

total no of trials

Statistical hypothesis

Methods to estimate the difference b/w estimates of samples

two hypothesis are made:

Null hypothesis or hypothesis of no difference

Alternative hypothesis of significant difference

Null hypothesis or hypothesis of no difference [Ho]

Asserts that there is no real difference in sample & general population

The difference found is accidental & arises out of sampling variations

Alternative hypothesis of significant difference [H1]

States that sample result is different than the hypothetical value of population

To minimize errors the sampling distribution or area under normal curve is divided into two regions or zones

1. Zone of acceptance :samples in the area of mean ±1.96 SE, null hypothesis – accepted

2. Zone of rejection: sample in the shaded area is beyond the mean ±1.96 SE, null hypothesis – rejected

Degrees of freedom

The quantity in the denominator which is one less than the independent no of observations in sample.

Eg:

When there are 10 values , 9 choices or degrees of freedom

In unpaired t –test of difference between 2 means

df = n1+n2-2

Where;n1 & n2 are no observations.

In paired t- test df = n-1

Standard error

A measure of variability of the mean sample

Obtained by SD / square root of the sample size

SE = SD

√ n

2 types of errors;

Type 1 error

Type 2 error

Type I error :

– null hypothesis is rejected { when it is true}

Type I error:α

The null hypothesis is rejected even it falls in the zone of acceptance

serious error.

Type II error

– null hypothesis is wrongly accepted

β error

the null hypothesis is accepted even it falls in the zone of rejection

Nullhypothesis is true

Accept it

Correct decision

Reject it

Type I error

Nullhypothesis is false Type II error Correct decision

not serious error, needs only confirmation of result by changing the level of significance

Tests of significance

Parametric and non parametric tests or methodsParametric methods

The methods of statistical inference that are based on the assumption that the population has a certain probability distribution, the resulting collection of statistical tests and procedures are referred to as parametric methods. For example, t- distribution and F-distribution are associated with the values of parameters of an assumed normal probability distribution.Non parametric methods

The statistical procedures that do not require assumptions of any form of probability distribution from which experiments come are known as non parametric methods. These are also called distribution free methods. For example, chi square frequency techniques are non parametric.

Parametric tests

Eg. T test, Z test, Chi-square test,Pearson correlation coifficient

Non parametric tests

Eg. Chi-square test, Kruskal-Wallis test, Spearman correlation coifficient

Tests of significance- Steps involved

Define the problem

state the hypothesis

Null hypothesis

Alternate hypothesis

Fix the level of significance

Select appropriate test to find test statistic

Find degree of freedom (df)

Compare the observed test statistic with theoretical one at desired level of significance & corresponding DF

If the observed test statistic value is greater than the theoretical value, reject the null hypothesis.

Draw the inference based on the level of significance

Objective of using tests of significance

To compare – sample mean with population

Means of two samples

Sample proportion with population

Proportion of two samples

Association b/w two attributes

t - test

Student’s t-test

Designed by W.S Gossett

Unpaired t- test (two independent samples)

Paired t- test ( single sample correlated observation)

Essential conditions:

randomly selected samples from the corresponding populations

Homogeneity of variances in the 2 samples

Quantitative data

Variable normally distributed

samples < 30

Unpaired t- test

Unpaired data of independent observation made on the individual of two different or separate groups or samples drawn from 2 populations

Null hypothesis is stated

difference between means of two samples

(X1-X2) measures variation in variable

calculate the t value

t = (X1-X2)

SE

Paired t- test

To study the role of factor or cause when the observations are made before & after the its play:

Eg: exertion on pulse rate, effect of a drug on blood pressure etc

To compare the effect of 2drugs , given to the same individual in the sample on two different occasions

eg: adrenaline & noradrenaline on pulse rate

to study the comparative accuracy of 2 difft instruments

eg: 2 difft types of sphygmomanometers

to compare the results of 2 difft lab techniques

To compare the observations made at two different sites in the same body

Testing procedure:

Null hypothesis

X1-X2= x

Calculate mean of the difference x = Σ x /n

calculate SD of differences & SE of mean

SE= SD/ √ n

Determine t value

t = x -o

SD / √n

• Find the degrees of freedom , n-1

• refer the table & find the probability

• P >0.05 not significant

• P< 0.05 significant

Variance ratio test or F test

Variance: a measure of the extent of the variation present in a set of data

Obtained by taking the sum of squares

Measured in squared units

Comparison of variance b/w two samples

Test developed by Fisher & Snedecor

Involves another distribution called F – distribution

Calculate variance of two samples first, S1 2 & S2 2,

(Variance = SD²)

F = S12 / S2

2

S12 > S2

2

S12 - numerator

Significance of F by referring to F- table

Degrees of freedom , (n1 – 1 ) & (n2 – 1) in the two samples

Table gives variance ratio values at diff levels of significance at df (n1 – 1) given horizontally and (n2 – 2) , vertically

Eg sample A : sum of squares = 36 ; df = 8

Sample B : sum of squares = 42 : df = 9

F = 42/9 / 36 /8 = 42/9 x 8/36 = 1.04

This value of F < table value at p =0.05, not significant

Analysis of variance(ANOVA) test

Compare more than two samples

Compares variation between the classes as well as within the classes

For such comparisons there is high chance of error using t or Z test

Variation in experimental studies – natural variation/ random / error variation

Variation caused due to experimenter- imposed variation or treatment variation

A :b/w groups variation = random variation (always) + imposed variation (maybe)

B :Within group variation = random variation

Total variation = A+B

If there is no real difference b/w groups, then

between treatment = random variation

Within treatment random variation

If there is any real difference b/n the R/

between treatment = random variation+ imposed variation > 1

Within treatment random variation

Chi square test ( χ ² test )

Non parametric test

Developed by Karl Pearson

Not based on normal distribution of any variable

Used for qualitative data

To test whether the difference in distribution of attributes in different groups is due to sampling variation or otherwise.

Applications

1. Test for goodness of fit

2. Test of association (independence)

3. Test of homogeneity or population variance

c2 test is non parametric in the first two cases and parametric in the third case

Calculation of χ² value

Three requirements –

A random sample

Qualitative data

Lowest expected frequency > 5

χ² = (observed f – expected f )²

Expected f

Expected f = row total x column total / grand total

df =( r-1)x (c-1)

Calculated value is correlated with table

Drawbacks :

Tells us about the association but fails to measure the strength of association.

Test is unreliable if the expected frequency in any one cell

is less than 5.

Correction is done by subtracting 0.5 from [ 0-E ]

― Yates’s correction

For Tables larger that 2 x 2 , Yates correction cannot be applied

Not applicable when there is 0 or 1 in any of the cells [ Resort to Fisher’s exact probability test ]

χ² values interpreted with caution when sample < 50

Non parametric tests:

A family of statistical tests also called as distribution free tests that do not require any assumption about the distribution the data set follows and that do not require the testing of distribution parameters such as means or variances.

Friedman’s test – nonparametric equivalent of analysis of variance

Kruskal – Wallis test – to compare medians of several independent samples equivalent of one –way analysis of variance

Mann – Whitney U test – compare medians of two independent samples. Equivalent of t test

McNemar’s test variant of chi squared test , used when data is paired

Wilcoxon’s Sign rank test – paired data

Spearman’s rank correlation – correlation coefficient

CONCLUSION

Its more important to understand the indications and limitations of various statistical tests rather than the robust mathematical calculations since the latter is taken care of by the software like SPSS

Understanding the classification of data is crucial for the selection of appropriate test of significance

REFERENCES

B.K. Mahajan. Methods in Biostatistics, 6th edition, Jaypee brothers

P.S.S.Sundar Rao, J.Richard. An introduction to Biostatistics,3rd edition, Prentice Hall of India.

James F Jekel, David L Katz, Joann G Elmore. Epidemiology, biostatistics and preventive medicine, 2nd edition, WB Saunders Company

Research methodology- C.R.Kothari.

Preventive and Community Dentistry- Soben Peter 4th edition.

Documents

Biostatistics Word New