215

Applied Biostatistics - bhumipublishing.com

  • Upload
    others

  • View
    4

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Applied Biostatistics - bhumipublishing.com
Page 2: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics:

An Essential Tool in Helathcare Profession (ISBN: 978-81-953600-1-7)

Ms. Archana V. Nerpagar

Assistant Lecturer

Bhonsala Military School, Nashik

Mr. Umesh D. Laddha

Assistant Professor,

MET’s Institute of Pharmacy, Nasik

Mrs. Savita Mandan

Assistant Professor,

R. C. Patel Institute of Pharmaceutical

Education and Research, Shirpur

Dr. Sanjay J. Surana

Principal,

R. C. Patel Institute of Pharmaceutical

Education and Research, Shirpur

Dr. Sanjay J. Kshirsagar

Principal,

MET’s Institute of Pharmacy, Nasik

2021

Page 3: Applied Biostatistics - bhumipublishing.com

First Edition: 2021

ISBN: 978-81-953600-1-7

Copyright reserved by the publishers

Publication, Distribution and Promotion Rights reserved by Bhumi Publishing, Nigave Khalasa, Kolhapur

Despite every effort, there may still be chances for some errors and omissions to have crept in

inadvertently.

No part of this publication may be reproduced in any form or by any means, electronically, mechanically,

by photocopying, recording or otherwise, without the prior permission of the publishers.

The views and results expressed in various articles are those of the authors and not of editors or

publisher of the book.

Published by:

Bhumi Publishing,

Nigave Khalasa, Kolhapur 416207, Maharashtra, India

Website: www.bhumipublishing.com

E-mail: [email protected]

Book Available online at:

https://www.bhumipublishing.com/books/

Page 4: Applied Biostatistics - bhumipublishing.com

Preface

We have great pleasure and privilege in presenting the book “Applied Biostatistics:

An Essential tool in Helathcare Profession” which is based on new PCI syllabus and

will be helpful in course of B. Pharm, M. Pharm, B. Sc., M. Sc. and B. Ed.

This book accentuates the relationships among probability, probability distributions

and hypothesis testing. To understand the methodology of hypothesis testing, concept

of null and research hypothesis has been highlighted, Book also contains depth study

of the standard parametric analysis along with nonparametric alternatives.

Nonparametric techniques are further useful in research activities with small sample

size.

In this book every topic has been explained in details and supported by sufficient

solved examples. The questions are categorised according to the types of methods

applied. We tried to maintain language as simple as possible which will help students

to understand the statistical concepts more easily. We have also tried to cover

information in much more depth in order to ensure that reader will be benefited for

competitive exams preparation.

We hope that this book will be appreciated and accepted by all Institutes, teachers

and students. There may be few mistakes and deficiencies, we will be grateful if

readers point out them and revert to us. Also we will welcome any suggestions from

your side.

Page 5: Applied Biostatistics - bhumipublishing.com

Acknowledgment

First and foremost, praises and thanks to the God, the Almighty, for His showers of

blessings throughout my work to complete this book successfully.

I would like to express my deep and sincere thanks to Dr. Sanjay J. Surana, Principal,

R. C. Patel Institute of Pharmaceutical Education and Research, Shirpur and Shirpur

Education Society (SES), Shirpur for giving me opportunity and to make me able to

write this book.

I wish to express my special gratitude to my soul mate Mr. Umesh D. Laddha whose

dynamism, vision, sincerity and motivation have deeply inspired me for the

completion of this book.

I am also thankful to Mrs. Savita S. Mandan from R. C. Patel Institute of

Pharmaceutical Education and Research, Shirpur for her valuable addition.

I am also thankful to Central Hindu Military Education Society, Nashik for providing

facilities while writing this book.

I am extremely grateful to my parents for their love, prayers, caring and sacrifices for

educating and preparing me for my future. I am very much thankful to my Son

Devesh for his love, understanding and continuing support to complete my work. Also

I express my thanks to my sisters, brother and in laws for their support and valuable

time.

- Archana V. Nerpagar

Page 6: Applied Biostatistics - bhumipublishing.com

This book is dedicated to

Hard work, Patience and Efforts….

And

To my lovely Son Devesh

Page 7: Applied Biostatistics - bhumipublishing.com

Index

Sr. No. Topic Page No.

1. Introduction to biostatistics 1 – 59

2. Probability 60 – 88

3. Sample and sampling techniques 89 – 100

4. Correlation 101 – 111

5. Regression 112 – 121

6. Sampling Variability, Significance & Statistical inference 122 – 128

7. Testing of hypothesis 129 – 148

8. ANOVA 149 – 159

9. Chi-square test 160 – 169

10. Non-parametric test 170 – 183

11. Experimental design 184 – 188

12. Applications of Biostatistics in Pharmacy 189 – 190

13. References 191

14 Standard Value Tables 192 – 205

Page 8: Applied Biostatistics - bhumipublishing.com
Page 9: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

1

1. INTRODUCTION TO BIOSTATISTICS

Introduction:

Statistics is a very broad subject, with the applications in a vast number of different fields.

Statistics or statistical analysis is the branch of mathematics which deals with the study of

collection, analysis, interpretation, presentation and organization of data. In other words, statistics

is the methodology which scientists and mathematicians have developed for interpreting and

drawing conclusions from collected data. It is the science of gaining information from numerical

and categorical data. Statistics in practice is applied successfully to study the effectiveness of

medical treatments, the reaction of consumers to television advertising, the attitude of young

people toward marriage, and much more. It’s safe to say that nowadays statistics is used in every

field of science.

Biostatistics is defined as the application of statistical tools and methods to the data

derived from biological sciences. It is the application of statistics in the development and use of

therapeutic drugs and devices in humans and animals. The science of biostatistics consists of

biological experiments (specifically in medicine, pharmacy, agriculture and fishery), the collection,

analysis and interpretation of data, the inferences and results. The goal of Biostatistics is to promote

statistical science and its application in the study of medicine, human health and disease.

Let us first define some basic terms of statistics that are necessary for understanding

biological and agricultural analysis.

1. Statistical Data:

The collection of numerical statements of facts is called data. This numerical data in

statistical analysis is obtained from scientific enquiry. The numerical facts in the collected in

scientific data are known as observations.

In statistics, data is all about its characteristics. Characteristic means the quality possessed

by an individual or observation.

Characteristics are of two types:

(i) Non measurable characteristics (Attributes): The characteristics related to the qualities of

the observations are called attributes. E.g. sex, literacy, blood group, pass, fail etc.

(ii) Measurable characteristics (variables): The characteristics related to the quantity of the

observations are called variables. Their values are always varying E.g. height, weights and

ages of persons, temperature, water salinity, etc.

For example, weights of children in a class are 35kg, 37kg, 32kg, 38kg, 34kg, 39kg, 36kg and

40kg. This statistical statement contains numerical values which is data for analysis with 7

observations.

Page 10: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

2

Type of Data:

Data can be classified into two types:

a) Qualitative data:

The data which deals with the descriptions or qualities of individuals is qualitative data.

This data can be observed but cannot be measured using any unit. The qualitative characteristics

are known as attributes.

e.g. blood group, colours, smells, tastes, appearance, emotions etc.

b) Quantitative data:

The data which deals with the numerical values of the individuals is quantitative data. This

data can be measured in some units. The quantifiable characteristics are known as variables.

e.g. height, weight, length, temperature etc.

The quantitative variables are further divided as follows:

i) Discrete variables: The variables that can take only specific and finite number of values in the

given range are known as discrete variables. Discrete variables are countable in finite amount of

time.

For example, we can count the change in our pocket, money in bank account, number of

tablets in a pack, number of students in a class, parity, myocardial infarction.

ii) Continuous variables: The variables that can take on infinite number of values in the given

range are known as continuous variables. Continuous variables would take forever to count i.e. we

would get to forever but never finish counting them. Many of the variables studied in biology are

continuous variables.

For example, age. We cannot count exact age because it would take infinite value forever.

Age of person could be counted as; 25 years, 5 months, 10 days, 6 hours, 40 minutes, 4 seconds, 4

milliseconds, 9 nanoseconds, 10 picoseconds,...and so on. Also weight, diastolic blood pressure,

volume, time required to recovery.

Collection of Data:

Data Collection is an important aspect of any type of research study. Inaccurate data

collection can impact the results of a study and ultimately lead to invalid results. The data can be

collected on the basis of qualitative or quantitative characteristics. To check the effect of drug in

curing a disease, we have to collect the quantitative information about patients before and after

application of drug.

There are many methods of data collection depending on our research designs and

methodologies. Generally, data is collected from two sources, primary sources and secondary

sources.

Primary sources: The original source or first hand from which information is collected is called

primary source and the data collected from primary source is primary data. i.e. When an

Page 11: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

3

investigator collects data himself with a definite plan or purpose in his mind then it is called

primary data. E.g. data obtained by census commissioner for population census.

To collect primary data following methods are used:

a) Observation method: Observation is the main source of information in the field of research. In

this method observations are recorded from experiments or a specific situation.

b) Questionnaire method: This method plays an important role in data collection process.

Questionnaire, usually, consists of number of objective questions that the respondent has to

answer.

The questionnaire should be designed properly. All questions to be asked are relevant to

subject of research. Questions should be short, simple and clear and easy to understand. They

should be arranged in order from easy to difficult. The information through questionnaire can be

collected by mail or post which is called postal inquiry.

c) Interview method: Interview is the verbal conversation between with two people in order to

collect required information for research. It is an interactional communication in which questions

are asked by interviewer for specific purpose to obtain research related information and answers

are given by interviewee. There are different types of interview like Personal interview, Telephone

interview, Focus Group interview, Depth interview, Projective techniques.

Secondary Sources: The sources of information such as published literature or published reports

are known as secondary sources and data collected from secondary sources is secondary data. i.e.

data which is not originally collected but obtained from published or unpublished sources is called

secondary data. Some of the secondary sources are as follows:

1) Government publications, 2) Census report, 3) Periodicals and books, 4) Research review

journals, 5) Research articles, 6) Research papers, 7) Magazines, 8) Academic publications, 9)

Research literature, 10) Ph.D. Thesis.

Classification of Data:

The process of arranging collected data into homogenous groups or classes according to the

common characteristics is called classification. After collecting the qualitative or quantitative data it

is require to sort out data from questionnaire related to the common characteristics. Because of

proper classification the unnecessary information is dropped out.

e.g. During population census, people in the country are classified according to sex

(males/females), marital status (married/unmarried), residential place (rural/urban), age groups,

profession, etc.

Raw Data: When some information is collected randomly and presented, it is called a raw data.

For Example: Given below are the marks (out of 25) obtained by 20 students of class VII A in

mathematics in a test.

18, 16, 12, 10, 5, 5, 4, 19, 20, 10, 12, 12, 15, 15, 15, 8, 8, 8, 8, 16

Page 12: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

4

Observation:

Each entry collected as a numerical fact in the given data is called an observation.

Array:

The raw data when put in ascending or descending order of magnitude is called an array or

arrayed data.

For Example: The above data is arranged in ascending order and represented as:

4, 5, 5, 8, 8, 8, 8, 10, 10, 12, 12, 12, 15, 15, 15, 16, 16, 18, 19, 20

Range:

The difference between the highest and the lowest value of the observation is called the

range of the data.

In the above data,

Highest marks obtained = 20

Lowest marks obtained = 4

Therefore, 𝑟𝑎𝑛𝑔𝑒 = 20 − 4 = 16

Frequency Distribution:

a) Frequency: If an observation or variable is repeating twice or more in a given series of

observations then the number of repetition is called frequency of that observation.

e.g. consider the marks of 15 students in a class as follows: 22, 24, 20, 22, 23, 20, 25, 22, 22,

25, 20, 25, 20, 22, 24.

Here, number 20 is repeating 4 times. So frequency of 20 is 4.

22 is repeating 5 times. So its frequency is 5.

Similarly, frequency of 23 is 1, frequency of 24 is 2 and frequency of 25 is 3.

b) Frequency distribution: The tabular arrangement of observations in the collected scientific

data individually or in groups or classes along with their frequencies is called frequency

distribution.

c) Class: The group of observations in the data under our consideration is called class.

e.g. the marks out of 100 can be divided into the classes as 0-10, 10-20, 20-30, …, 90-100.

Classes are also known as class intervals. Each class interval is assigned two values. The

smallest value is called lower limit and the highest value is called upper limit of certain class

interval.

e.g. For a class 40-50,

Lower limit = 40 and upper limit = 50.

Classes can be of two types:

i) Continuous classes: The classes of the form 10-20, 20-30, 30-40,… in which lower limit of any

class is equal to upper limit of its previous class are called continuous classes.

Page 13: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

5

ii) Non continuous classes: The classes of the form 10-19, 20-29, 30-39 … are called non

continuous classes.

But these classes can be made continuous by subtracting a term d

2 from lower limits of all

classes and adding a term d

2 into all upper limits of classes. The newly formed class intervals are the

continuous and are known as class boundaries,

Where, 𝑑 = difference between lower limit of any class and upper limit of its previous class.

Ex. Make the following non continuous classes continuous.

Class 20-29 30-39 40-49 50-59 60-69

Frequency 5 8 12 7 6

Ans: Here, the common difference

𝑑 = 30 – 29 = 40 – 39 = 50 – 49 = 60 – 59 = 1

d

2=

1

2= 0.5

So subtract 0.5 from all lower limits and add 0.5 into all upper limits of classes.

d) Class width (h): The difference between upper limit and lower limit of the class interval is called

class width. It is denoted by h .

Class width = upper limit− lower limit

e.g. for a class interval 45-55,

Class width = 𝑕 = 55 – 45 = 10

e) Class mark or mid value (X): The class mark or mid value is the value which lies exactly in the

middle of the class interval. It is denoted by X and given by,

𝑐𝑙𝑎𝑠𝑠 𝑚𝑎𝑟𝑘 = 𝑋 =𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 + 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡

2

e.g. for the class interval 30-40,

Class mark = 𝑋 = 30 + 40

2 = 35.

f) Relative frequency:

It is given by the formula,

Class Frequency Class Boundaries

20-29 5 19.5-29.5

30-39 8 29.5-30.5

40-49 12 30.5-40.5

50-59 7 40.5-50.5

60-69 6 50.5-60.5

Page 14: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

6

𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠 = 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠

𝑇𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

g) Percentage frequency:

𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠 = 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠

𝑇𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 × 100

h) Frequency density: If the class intervals of a frequency distribution are of unequal width, the

frequency densities can be used to compare the concentration of frequencies in class interval and to

construct histogram.

𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠 = 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠

𝐶𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡𝑕

There are two types of frequency distribution.

1) Discrete (ungrouped) frequency distribution:

In discrete frequency distribution, the observations are arranged in ascending order

without considering the repeated ones in the table. Second column of the table contains frequencies

of corresponding observations.

Ex. Following data gives number of members in 30 families. Classify the data and prepare frequency

distribution table.

3 3 4 2 4 3 5 6 2 4

3 4 1 6 3 2 7 6 1 1

5 5 3 2 1 3 1 5 4 3

Observations Tally Marks Frequency

1

2

3

4

5

6

7

5

4

8

5

4

3

1

Page 15: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

7

2) Continuous (grouped) frequency distribution:

Steps:

1) Find the maximum and minimum values from the given data.

2) Decide the number of class intervals to be formed.

3) Classes should be formed in such way that least value should be included in first class interval

and the maximum value should be included in the last class interval.

4) If the classes are continuous then the upper limit is included in next class i.e. if the class interval

is 15 – 20 then 20 will occur in next class interval.

Example:

Consider the following marks (out of 50) obtained in Mathematics by 60 students of Class VIII:

21, 10, 30, 22, 33, 5 , 37, 12, 25, 42, 15, 39, 26, 32, 18, 27, 28, 19, 29, 35, 31, 24,36, 18, 20, 38, 22, 44,

16, 24, 10, 27, 39, 28, 49, 29, 32, 23, 31, 21, 34, 22, 23, 36, 24, 36, 33, 47, 48, 50 , 39, 20, 7, 16, 36, 45,

47, 30, 22, 17.

If we make a frequency distribution table for each observation, then the table would be too

long, so, for convenience, we make groups of observations.

From the above data 5 is the minimum value and 50 is the maximum value. So we have to

make such classes that 1st class includes minimum value and last class includes maximum value. So

the class interval will be 0 – 10, 10 – 20 and so on. According to these classes find the frequency.

So the Frequency distribution table is as follows:

Groups Tally Marks Frequency

0-10 || 2

10-20

10

20-30

21

30-40

19

40 - 50

7

50 - 60 | 1

Total 60

Ex.1. The following data gives marks obtained to 50 students in Mathematics. Prepare grouped

frequency distribution table taking the class intervals 20-24, 25-29, 30-34, etc.

21 20 55 39 48 46 36 54 42 30

29 42 32 40 34 31 35 37 52 44

39 45 37 33 51 53 52 46 43 47

41 26 52 48 25 34 37 33 36 27

54 36 41 33 23 39 28 44 45 38

Page 16: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

8

Class 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59

Frequency 3 5 8 11 8 7 7 1

Note: Above class intervals are not continuous. But if the class intervals are continuous then the

upper limit value is not included in that corresponding class. See the example given below.

Ex.2. Prepare a grouped frequency distribution table from following data. Take the classes 30-55,

55-80, etc.

110 175 161 157 155 108 164 128 114 178

165 133 195 151 71 94 97 42 30 62

138 156 167 124 164 146 116 149 104 141

103 150 162 149 79 113 69 121 93 143

140 144 187 184 197 87 40 122 103 148

Classes 30-55 55-80 80-105 105-130 130-155 155-180 180-205

Frequency 3 4 6 9 12 11 5

Here, observation 155 is included in class 155-150 and not in 130-155, where it is an upper

limit.

Frequencies can also be distributed using cumulative frequency.

Cumulative frequency (c.f.): The successive addition of frequencies in the table is known as

cumulative frequency. It is calculated by adding each frequency from a frequency distribution table

to the sum of its predecessors. Cumulative frequency is used to determine the number of

observations that lie above (or below) particular value in a data set.

There are two types of frequency distribution:

a) Cumulative Frequency Less Than Type (c.f.l.t.t.):

The successive addition of frequencies of all classes previous to the current class is

cumulative frequency less than type. The addition is carried out from top to bottom i.e. from lowest

class to the highest class.

b) Cumulative Frequency More Than Type (c.f.m.t.t.):

It is obtained by adding the frequencies of highest class to the lowest class i.e. addition of

frequencies is from bottom to top.

Ex.1. Find cumulative frequency distribution for following data.

Class Limits 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Frequencies 2 4 7 10 16 8 3

Page 17: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

9

Ans:

Less than C.F More than C.F

Class Limit Freq C.B Marks C.F Marks C.F

10 – 19 2 9.5 – 19.5 Less than 19.5 2 9.5 or more 48 + 2 = 50

20 – 29 4 19.5– 29.5 Less than 29.5 2 + 4 = 6 19.5 or more 44 + 4 = 48

30 – 39 7 29.5–39.5 Less than 39.5 6 + 7 = 13 29.5 or more 37 + 7 = 44

40 – 49 10 39.5–49.5 Less than 49.5 13+10= 23 39.5 or more 27 +10 = 37

50 – 59 16 49.5– 59.5 Less than 59.5 23+16= 39 49.5 or more 11 +16 = 27

60 – 69 8 59.5– 69.5 Less than 69.5 39 +8 = 47 59.5 or more 3 + 8 = 11

70 – 79 3 69.5– 79.5 Less than 79.5 47+ 3 = 50 69.5 or more 3

Total 50

Constructing relative frequency and percentage frequency tables:

Thirty AA batteries were tested to determine how long they would last. The results, to the

nearest minute, were recorded as follows:

423, 369, 387, 411, 393, 394, 371, 377, 389, 409, 392, 408, 431, 401, 363, 391, 405, 382, 400, 381,

399, 415, 428, 422, 396, 372, 410, 419, 386, 390

An analyst studying these data might want to know not only how long batteries last, but also

what proportion of the batteries falls into each class interval of battery life.

This relative frequency of a particular observation or class interval is found by dividing the

frequency (f) by the number of observations (n): that is, (f ÷ n). Thus:

Relative frequency = frequency ÷ number of observations

The percentage frequency is found by multiplying each relative frequency value by 100.

Thus:

Percentage frequency = relative frequency X 100 = f ÷ n X 100

Battery life,

minutes (x)

Frequency (f) Relative

frequency

Percent

frequency

360-369 2 0.07 7

370-379 3 0.1 10

380-389 5 0.17 17

390-399 7 0.23 23

400-409 5 0.17 17

410-419 4 0.13 13

420-429 3 0.1 10

430-439 1 0.03 3

Total 30 1 100

Page 18: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

10

From the above table the analyst can conclude that:

7% of AA batteries have a life of from 360 minutes up to but less than 370 minutes, and the

probability of any randomly selected AA battery having a life in this range is approximately 0.07.

Representation of data (Data Graphics):

Whenever we collect statistical data, it is difficult for common person to understand it. We

expect that common person should pay attention to the figures and compare two or more sets of

observations which are mostly presented in reports or newspapers.

The representation of collected scientific or statistical data in simple manner and attractive

form using different diagrams and graphs to understand common person is called data graphics.

Data can be represented in two ways:

1) Diagrams

2) Graphs

1) Diagrammatic representation of data:

It is visual form for presentation of statistical data, highlighting the basic facts and

relationship. The diagrams drawn on the basis of collected data are easily understood and

appreciated by all. A large number of diagrams are used in bio-statistical analyses. Here are the

important types of diagrams which are commonly used for presentation of qualitative data:

a) Line Diagram

b) Bar Diagram

(i) Simple bar diagram

(ii) Multiple bar diagram

(iii) Divided/Sub-divided bar diagram

(iv) Percentage bar diagram

c) Pie Diagram

a) Line Diagram:

It is the simplest type of diagram. In line diagram only lines are drawn to represent given

variables. The variable is taken along X-axis and the frequencies of the observations are taken along

Y-axis. The lines may be vertical or horizontal. The distance between lines is kept uniform. The lines

are drawn such that their length is proportional to the frequencies.

Ex.1. The table below shows Sam's weight in kilograms for 5 months.

Month Weight in kg

January 49

February 54

March 61

April 69

May 73

Page 19: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

11

The data from the table above has been summarized in the line graph below.

b) Bar Diagram:

It is commonly used to represent the statistical data. Bar is a thick line. In bar diagram only

the length or height of bars is taken into consideration. The data is represented by thick bars of

uniform width keeping the uniform gaps in between two bars. The lengths or heights of bars are

taken proportional to the values they represent. Bars can be drawn vertically or horizontally.

The bar diagram is classified into four main types:

(i) Simple Bar Diagram:

It is used to represent only one observation i.e. one bar represents one observation. So

there are as many bars as the number of observations. We can use different colours or shades to

identify data and to make the diagram attractive.

Ex.1. Draw simple bar diagram to represent the profits of a bank for 55 years.

Ans:

01020304050607080

We

igh

t in

kg

Weight in kg

Weight in kg

0

5

10

15

20

25

30

35

40

45

1989 1990 1991 1992 1993

Profits (million $$)

Years 1989 1990 1991 1992 1993

Profits

(million $) 10 12 18 25 42

Page 20: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

12

Ex.2. Represent following data by simple bar diagram.

Ans:

(ii) Multiple Bar Diagram:

It is also known as compound bar diagram. Multiple bar diagram is used for comparing two

or more variables. The number of variables may be 2, 3 or 4 or more. The bars are drawn adjacent

to each other as per the number of variables. In case of 2 variables, pair of bars is drawn. In case of

3 variables, we draw triple bars. In order to distinguish bars, they may be either differently

coloured or different type of crossing or dotting is used. An index is also prepared to identify the

meaning of different colours or dotting.

Ex.1. Draw multiple bar diagram to represent the imports and exports of Canada (values in $) for

the years 1991 to 1995.

Years Imports Exports

1991 7930 4260

1992 8850 5225

1993 9780 6150

1994 11720 7340

1995 12150 8145

Ans:

0

10

20

30

40

50

60

1971 1981 1991 2001 2011

Population

Population

0

2000

4000

6000

8000

10000

12000

14000

1991 1992 1993 1994 1995

Imports

Exports

Years 1971 1981 1991 2001 2011

Population 45 40 50 52 47

Page 21: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

13

Ex.2. Represent following data by a sub divided bar diagram.

College No. of students

Arts Science Commerce Agriculture

A 120 800 600 400

B 750 500 300 450

Ans:

(iii) Divided/Sub-divided Bar Diagram:

It is also known as Component bar diagram because each bar is sub-divided according to

components consisting in it. The complete bar represents the total values of observation along with

various values of components. Each component can be distinguished from the other by different

colour.

Ex.1. The table below shows the quantity in hundred kgs of wheat, barley and oats produced in a

certain form during the years 1991 to 1994.

Years Wheat Barley Oats

1991 34 18 27

1992 43 14 24

1993 43 16 27

1994 45 13 34

1995 35 15 27

Draw sub-divided bar diagram to illustrate data.

0

200

400

600

800

1000

Arts Science Commerce Agriculture

No. Of students

A

B

0

20

40

60

80

100

1991 1992 1993 1994 1995

Oats

Barley

Wheat

Page 22: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

14

Ex.2. Represent the data by suitable bar diagram.

Year Marks in

Maths Stat Practical

2005 40 60 90

2006 35 55 85

2007 45 40 80

(iv) Percentage Bar Diagram:

Like sub-divided bar diagram, in this case also data of one variable (observation) is put on

single bar, but in terms of percentage. All the bars in this diagram are equal in heights representing

the value 100 as a percentage. The values of all variables are converted into percentages. The

component part of each division is depicted in percentages in each bar.

Ex.1. The table below shows the quantity in hundred kg of wheat, barley and oats produced in a

certain form during the years 1991 to 1994.

Years Wheat Barley Oats

1991 34 18 27

1992 43 14 24

1993 43 16 27

1994 45 13 34

Ans:

1991 1992 1993 1994

%

value

Cum

Freq

%

value

Cum

Freq

%

value

Cum

Freq

%

value

Cum

Freq

Wheat 43 43 53 53 50 50 49 49

Barley 23 66 17 70 19 69 14 63

Oats 34 100 30 100 31 100 37 100

Total 100

100

100

100

0

50

100

150

200

2005 2006 2007

Marks in Practical

Marks in Stat

Marks in Maths

Page 23: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

15

c) Pie Diagram:

A pie diagram or pie chart is a circular graph in which a circle is divided into sectors. The

angle of sector is proportional to the frequency or percentage of observation. Different shades or

colours can be used to differentiate the variables.

Steps to construct pie diagram/chart:

1. Express the given values of the variables in terms of angles/degrees of the total value. i.e.

If set of actual values of frequencies is given then angle is given by

a. 𝜃 = 𝑎𝑐𝑡𝑢𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛

𝑇𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 × 360

If frequencies are given in terms of percentages then

b. 𝜃 = % 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛

100 × 360

2. Draw a circle of appropriate radius with compass.

3. With the help of radius as a base line draw first angle of first component at the centre of

circle using protector.

4. Draw all sectors representing components of given data.

5. Label the sectors and circle graph.

6. If different shades or colours are used for the components then prepare an index.

Ex.1. Draw the pie chart for following data.

Item Agriculture Irrigation Health Education

Expenditure 4200 1500 1000 500

Ans:

Item Expenditure Angle (𝜽)

Agriculture 4200 210

Irrigation 1500 75

Health 1000 50

Education 500 25

Total 7200 360

0%

20%

40%

60%

80%

100%

1991 1992 1993 1994

Oats

Barley

Wheat

Page 24: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

16

Significance of diagrammatic representation:

Diagrams are an advanced technique to represent data. As a layman, one cannot understand

the tabulated data easily but only a single glance at the diagram, one gets complete picture of data

presented. According to M.J. Moroney, “diagrams register a meaningful impression almost before

we think.”

Diagrams are useful because of the following reasons:

(i) They give very clear picture of data.

(ii) They are easy to understand in short time.

(iii) They facilitate comparison between different samples.

(iv) They help to remember data easily.

(v) They make complex data simple.

(vi) They have universal utility.

(vii) No mathematics knowledge is required to draw and understand diagrams.

Limitations of diagrammatic representation:

Diagrammatic representation has the following limitations:

(i) Diagrams do not show the small differences properly.

(ii) In statistical analysis, diagrams are of no use.

(iii) Diagrams are just supplement to tabulation.

(iv) Diagrammatic presentation of data shows only on estimate of the actual behaviour of the

variables.

(v) They can be used only for comparative studies.

2) Graphic Representation of Data:

A graph is an intense or bright form of presentation of data. Graphic method helps to

present quantitative data in a simple, clear and attractive manner. It is the simplest and commonest

support to the numerical reading which gives a picture of numbers in such a way that the variables

can be easily compared.

Expenditure

Agriculture

Irrigation

Health

Education

Page 25: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

17

Graphic method of representation of data is becoming more effective and powerful than the

diagrammatic representation. It plays an important role of comparison is all fields of study.

According to A. L. Boddington, “The wandering of a line is more powerful in its effect on the mind

than a tabulated statement; it shows what is happening and what is likely to take place, just as

quickly as the eye is capable of working.” The presentation of statistics in the form of graphs

facilitates many processes. Frequency distribution can be represented graphically in following

ways:

a) Histogram

b) Frequency polygon

c) Frequency curve

d) Ogive curve

a) Histogram:

It is one of the most important and useful methods of presenting continuous frequency

distribution. A histogram is similar to a bar diagram which shows continuous frequency

distribution of quantitative data. In this, the continuous class intervals are taken along X-axis and

the frequencies on Y-axis.

Steps to construct histogram:

1. Draw the vertical and horizontal axes using scale.

2. Take the continuous classes on X-axis and if the classes are not continuous then make them

continuous and write on X- axis.

3. Take the frequencies on Y-axis with certain multiples.

4. Draw the bar up to the required frequency for each class interval.

5. Different shades or colours can be used to decorate histogram.

Ex.1. Draw a Histogram for the following data.

Classes 8-10 10-12 12-14 14-16 16-18 18-20 20-22 22-24

Frequency 24 52 42 48 12 8 14 6

Page 26: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

18

Ex.2. Draw a Histogram for the following data.

Classes 10-15 15-20 20-25 25-30 30-35

Frequency 2 6 7 5 3

b) Frequency Polygon:

Frequency polygon is a line graph derived from histogram by joining the mid points of all

bars in histogram. It begins and ends at the base line i.e. X-axis.

Steps to construct frequency polygon:

1. Draw the histogram for the given data with continuous class intervals.

2. Take the class interval before the given first class and class interval after the last class with

frequency as zero and the constant width.

3. Mark the mid points of all classes at the top of the bars. Also mark the mid points of two

extra classes in step 2.

4. Join all mid points successively with straight lines.

5. The complete bounded figure is frequency polygon.

Ex.1. Draw a frequency polygon for the following data.

Monthly wages

(New classes)

9-11 11-13 13-15 15-17 17-19 19-21 21-23 23-25 25-27

No. of workers 0 6 53 85 56 21 16 8 0

Page 27: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

19

c) Frequency Curve:

With the help of frequency polygon and histogram, we can draw a smooth curve. It is

obtained by joining the points in frequency polygon with free hand in order to get smooth curve. It

removes the ruggedness of polygon. A smoothed frequency curve represents a generalised

characterization of the data collected from the population or mass. Like frequency polygon,

frequency curve also begins and ends at the base line.

d) Ogive curve and cumulative frequency polygon:

It is also known as Cumulative frequency curve, as it is used to represent cumulative

frequency distribution of continuous classes. As there are two types of cumulative frequencies i.e.

less than type and more than type, accordingly there are two types of ogives for any grouped

frequency distribution.

(i) Less than frequency curve (Ogive)

(ii) More than frequency curve (Ogive)

(i) Less Than Frequency Curve:

In this, cumulative frequency less than type is calculated and plotted against the upper limit

of the classes. The points so obtained are joined by a smooth curve. It is an increasing curve sloping

upward from left to right of the graph. It is in the shape of an elongated ‘S’.

(ii) More Than Frequency Curve:

In this, cumulative frequencies more than type are calculated and plotted against the lower

limit of the classes. The points so obtained are joined by a smooth curve. It is a decreasing curve

sloping downward from left to right of the graph. It is in the shape of elongated upside down ‘S’.

An interesting feature of the two ogive curves together is that their point of intersection

gives the median.

Steps for constructing an ogive:

1. Prepare the required cumulative frequency distribution table either less than or more than

or both.

2. Draw and label the X (horizontal) and the Y (vertical) axes.

3. Represent the cumulative frequencies on the Y-axis and the class limits on the X-axis.

4. Plot the cumulative frequency at each class limit with the height being the corresponding

cumulative frequency.

5. Connect the points with segments. Less than ogive curve always starts from coordinate

point zero.

Significance of graphic representation:

Graphic representation is a visual form of presentation of data. It is more effective and

result oriented than diagrammatic representation. The presentation of statistics in the form of

graphs facilitates many processes in biostatistics.

Page 28: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

20

The main significances of graphs are as follows:

(i) They are more attractive and impressive than a table of figures.

(ii) They make comparison easy.

(iii) They help to present data in simple and understandable way.

(iv) Correlation between two series can be studied easily.

(v) They save time and energy of statistician as well as observer.

(vi) It needs no special knowledge of mathematics to understand graphs.

Limitations of Graphic Representation of Data:

Graphic representation of data suffers from following limitations:

(i) Graphs may be misused by taking false scales.

(ii) Accuracy is not possible in graph.

(iii) Graphs do not measure magnitude of data; they only show the fluctuations in them.

(iv) The interpretation of graphs varies from person to person.

Measures of Central Tendency

After collecting a set of statistical data, we are usually interested in making some statistical

summary statements about this large and complex set of individual values of variables. According

to Prof. Bowley, “Measures of central tendency are statistical constants which enable us to

comprehend in a single effort the significance of the whole.”

The measures of central tendency describe a distribution in terms of its most ‘frequent’,

‘typical’ or average’ data value. It is a summery measure that attempts to describe a complete set of

data with a single value which represents the centre of distribution. So, measures of central

tendency are sometimes also called as measures of central location.

The Measures of Central Tendency are used:

1) To concentrate data at a single value.

2) To facilitate comparison between data.

Criteria for an Ideal Measure of Central Tendency:

(i) It should be properly and rigidly defined.

(ii) It should be simple to understand & easy to calculate.

(iii) It should be based upon all values of given data.

(iv) It should be capable of further mathematical treatment.

(v) It should have sampling stability.

(vi) It should be not be unduly affected by extreme values.

The following are the five measures of central tendency which are common in use:

1. Arithmetic Mean (AM)

2. Median

3. Mode

Page 29: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

21

4. Geometric Mean (GM)

5. Harmonic Mean (HM)

1. Arithmetic Mean (AM):

The A.M. or simply mean is the most popular and well known measure of central tendency.

It is also known as ‘average’. Many statistical analyses use the mean as a standard reference point.

A.M. is defined as the sum of all observations divided by the number of observations in the data.

Calculation of Arithmetic mean:

Depending on type data arithmetic mean is calculated as follows:

a) For raw data:

If 𝑥1 ,𝑥2 , . . . , 𝑥𝑛 are n observations then AM is given by

AM = 𝑥 = 𝑆𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡 𝑖𝑜𝑛𝑠

= 𝑥1+𝑥2+ …+ 𝑥𝑛

𝑛

𝑥 = 𝑥𝑖

𝑛

Ex.1. The weights of 5 students (in kg) are 20, 21, 25, 14 and 30. Find the average of their weights.

Ans: Here, n = 2

𝑥 = 𝑥𝑖

𝑛

= 20 + 21 + 25 + 14 + 30

5

= 110

5

= 22 kg

Ex.2. Arithmetic mean of 5 values 2, 4, a, 9, 5 is 6. What will be the value of ‘a’?

Ans: Here, 𝑛 = 5

𝑥 = 6

So, 𝑥 = 𝑥𝑖

𝑛

6 = 2 + 4 + 𝑎 + 9 + 5

5

30 = 𝑎 + 20

𝑎 = 30− 20 = 10

Ex.3. The average of p and 4p is 10. Find the value of p.

Ans: Here, 𝑛 = 2 i.e. p and 4p

Page 30: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

22

𝑥 = 10

So, 𝑥 = 𝑥𝑖

𝑛

10 = 𝑝 + 4𝑝

2

20 = 5𝑝

𝑝 = 20

5

𝑝 = 4

Ex.4. Arithmetic mean of 2k, 4, 7, 5, k is 20. Find value of k.

Ans: Here, 𝑛 = 5

𝑥 = 20

So, 𝑥 = 𝑥𝑖

𝑛

20 =2𝑘 + 4 + 7 + 5 + 𝑘

5

100 = 3𝑘 + 16

84 = 3𝑘

𝑘 = 28

b) For discrete frequency distribution:

If 𝑥1 ,𝑥2 , . . . , 𝑥𝑛 are n observations along with the corresponding frequencies 𝑓1,𝑓2, . . . , 𝑓𝑛 then AM is

given by

AM = 𝑥 = 𝑓1𝑥1+ 𝑓2 𝑥2+ …+ 𝑓𝑛 𝑥𝑛

𝑓1+ 𝑓2+ …+ 𝑓𝑛

𝑥 = 𝑓𝑖 𝑥𝑖

𝑓𝑖

Frequency distribution table in this looks like,

Observations (𝑥𝑖) Frequencies (𝑓𝑖 ) 𝑓𝑖 𝑥𝑖

Ex.1. Calculate AM for following data.

Obs 10 20 30 40 50

freq 7 2 5 3 9

Ans: Prepare the following table.

Page 31: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

23

𝑥 = 𝑓𝑖 𝑥𝑖

𝑓𝑖 =

830

26 = 31.9230

Ex.2. Find average from the information given below.

Age in Yrs 1 2 3 4 5 6

No. Of deaths 12 15 18 10 9 8

Ans: Prepare the following table.

𝑥 = 𝑓𝑖 𝑥𝑖

𝑓𝑖=

229

72= 3.18056

c) For continuous frequency distribution:

This is also known as the step deviation method. For the continuous class intervals, the

AM is given by

1. Prepare the distribution table containing the columns as follows,

Class

Intervals

Frequencies

(𝑓𝑖)

Mid values

(X)

𝑢𝑖 = 𝑋𝑖−𝐴

𝑕 𝑓𝑖 𝑢𝑖

Obs ( 𝒙𝒊 ) Freq ( 𝒇𝒊) 𝒇𝒊 𝒙𝒊

10

20

30

40

50

7

2

5

3

9

70

40

150

120

450

Total 𝑓𝑖 = 26 𝑓𝑖 𝑥𝑖 = 830

Age in Yrs (𝒙𝒊) No. Of deaths (𝒇𝒊 ) 𝒇𝒊 𝒙𝒊

1

2

3

4

5

6

12

15

18

10

9

8

12

30

54

40

45

48

Total 72 229

Page 32: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

24

2. Find the mid values (X) of all classes in third column using formula,

𝑐𝑙𝑎𝑠𝑠 𝑚𝑎𝑟𝑘 = 𝑋 =𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 + 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡

2

3. Take any mid value as an assumed mean A; for easy calculations consider the middle value as

assumed mean.

4. Calculate step deviation in fourth column using 𝑢𝑖 =𝑋𝑖−𝐴

𝑕.

5. Find the multiplication of frequency and step deviation columns 𝑓𝑖 𝑢𝑖 .

6. AM is calculated by,

AM = 𝑥 = 𝐴 + 𝑓𝑖 𝑢𝑖

𝑓𝑖 𝑕

Where, 𝐴 = assumed mean taken from class marks(X)

𝑓𝑖 = Frequencies

𝑕 = Class width

𝑢𝑖 =𝑋𝑖−𝐴

𝑕, Step deviation

Ex.1. Calculate average marks by step deviation method from the following data.

Marks 0 – 10 10 – 20 20 - 30 30 - 40 40 - 50 50 - 60

No. Of students 42 44 58 35 26 15

Ans: This is continuous frequency distribution with continuous classes.

Prepare following table.

C I Freq. (𝒇𝒊) Mid values (X) 𝒖𝒊 = 𝑿𝒊−𝑨

𝒉 𝒇𝒊 𝒖𝒊

0 – 10

10 – 20

20 – 30

30 – 40

40 – 50

50 - 60

42

44

58

35

26

15

5

15

𝟐𝟓 = 𝑨

35

45

55

-2

-1

0

1

2

3

-84

-44

0

35

52

45

Total 220 4

A.M. = 𝑥 = 𝐴 + 𝑓𝑖 𝑢 𝑖

𝑓𝑖 𝑕

= 25 + 4

220 × 10

= 25 + 0.1818

= 25.1818

Page 33: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

25

Ex.2. Calculate mean by step deviation method.

Classes 0 – 10 10 – 20 20 – 30 30 - 40 40 - 50 50 - 60 60 - 70

Freq. 5 10 40 30 20 10 4

Ans: This is continuous frequency distribution with continuous classes.

Prepare following table.

C I Freq. (𝒇𝒊) Mid values (X) 𝒖𝒊 = 𝑿𝒊−𝑨

𝒉 𝒇𝒊 𝒖𝒊

0 – 10

10 – 20

20 – 30

30 – 40

40 – 50

50 – 60

60 – 70

5

10

40

30

20

10

4

5

15

25

𝟑𝟓 = 𝑨

45

55

65

-3

-2

-1

0

1

2

3

-15

-20

-40

0

20

20

12

Total 119 -23

A.M. = 𝑥 = 𝐴 + 𝑓𝑖 𝑢 𝑖

𝑓𝑖 𝑕

= 35 + −23

119 × 10

= 35− 1.93

= 34.07

Ex.3. Calculate mean by step deviation method.

Classes 0-30 30-60 60-90 90-120 120-150 150-180

Freq. 8 13 22 27 18 7

Ans: This is continuous frequency distribution with continuous classes.

Prepare following table.

C I Freq. (𝒇𝒊) Mid values (X) 𝒖𝒊 = 𝑿𝒊−𝑨

𝒉 𝒇𝒊 𝒖𝒊

0-30

30-60

60-90

90-120

120-150

150-180

8

13

22

27

18

7

15

45

𝟕𝟓 = 𝑨

105

135

165

-2

-1

0

1

2

3

-16

-13

0

27

36

21

Total 95 55

Page 34: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

26

A.M. = 𝑥 = 𝐴 + 𝑓𝑖 𝑢 𝑖

𝑓𝑖 𝑕

= 75 + 55

95 × 30

= 75 + 17.37

= 92.37

Ex.4. Calculate average marks by step deviation method.

Marks 0-10 10-20 20-30 30-40 40-50 50-60

No. Of stud 42 44 58 35 26 15

Ans: This is continuous frequency distribution with continuous classes.

Prepare following table.

C I Freq. (𝒇𝒊) Mid values (X) 𝒖𝒊 = 𝑿𝒊−𝑨

𝒉 𝒇𝒊 𝒖𝒊

0-10

10-20

20-30

30-40

40-50

50-60

42

44

58

35

26

15

5

15

𝟐𝟓 = 𝑨

35

45

55

-2

-1

0

1

2

3

-84

-44

0

35

52

45

Total 220 4

A.M. = 𝑥 = 𝐴 + 𝑓𝑖 𝑢 𝑖

𝑓𝑖 𝑕

= 25 + 4

220 × 10

= 25 + 0.18

= 25.18

Ex.5. Calculate mean by step deviation method.

Classes 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Freq. 5 10 20 40 30 20 10 4

Ans: This is continuous frequency distribution with continuous classes.

Prepare following table.

C I Freq. (𝒇𝒊) Mid values (X) 𝒖𝒊 = 𝑿𝒊−𝑨

𝒉 𝒇𝒊 𝒖𝒊

0-10

10-20

20-30

30-40

40-50

50-60

60-70

70-80

5

10

20

40

30

20

10

4

5

15

25

𝑨 = 𝟑𝟓

45

55

65

75

-3

-2

-1

0

1

2

3

4

-15

-20

-20

0

30

40

30

16

Total 139 61

Page 35: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

27

A.M. = 𝑥 = 𝐴 + 𝑓𝑖 𝑢 𝑖

𝑓𝑖 𝑕

= 35 + 61

139 × 10

= 35 + 4.39

= 39.39

Merits of Mean:

(i) It is rigidly defined.

(ii) It is easy to understand and easy to calculate.

(iii) It is the unique value.

(iv) It is based upon all values of the given data.

(v) It is capable of further mathematical treatment.

(vi) It is not much affected by sampling fluctuations.

Demerits of Mean:

(i) It cannot be calculated if any observations are missing.

(ii) It cannot be calculated for the data with open end classes.

(iii) It is affected by extreme values.

(iv) It cannot be located graphically.

(v) It may be a number which is not present in the data.

(vi) It can be calculated for the data representing qualitative characteristic.

2. Median:

The median is the value which divides the data into two equal parts. Half of the

observations are above the median and half are below it. It is determined by ranking the data and

finding the number of observations. It is another frequently used measure of central tendency.

Calculation of Median:

Depending on types of data, there are following methods for the calculation of median:

a) For raw data:

Steps:

1. Arrange the given data in ascending order.

2. If number of observations (n) is odd then median is the exact central value

i.e. 𝑴𝒆𝒅𝒊𝒂𝒏 = 𝒐𝒃𝒔 𝒂𝒕 𝒑𝒐𝒔𝒊𝒕𝒊𝒐𝒏(𝒏+𝟏

𝟐)

3. If number of observations is even then there are two central values say 𝑥1 and 𝑥2 such that

𝑥1 = ( 𝑛

2 )th observation and 𝑥2 = (

𝑛

2 + 1)th observation. Hence median is the average of

these two central values.

i.e. 𝑴𝒆𝒅𝒊𝒂𝒏 =𝒙𝟏+ 𝒙𝟐

𝟐

Page 36: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

28

Ex.1. Find median: 61, 63, 60, 64, 65, 62, 63, 69, 68.

Ans: Arrange data in ascending order: 60, 61, 62, 63, 63, 64, 65, 68, 69

Here, 𝑛 = 9 (odd no.)

Hence, 𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑎𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑛+1

2=

9+1

2= 5

𝑀𝑒𝑑𝑖𝑎𝑛 = 5𝑡𝑕 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = 63

Ex.2. Find median: 30, 60, 28, 35, 46, 47, 63, 64, 62, 32

Ans: Ascending order: 28, 30, 32, 35, 46, 47, 60, 62, 63, 64

Here, 𝑛 = 10 (even no.)

Hence, 𝐶𝑒𝑛𝑡𝑟𝑎𝑙 𝑜𝑏𝑠 = 𝑂𝑏𝑠 𝑎𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠 𝑛

2 &

𝑛

2+ 1

= 𝑂𝑏𝑠 𝑎𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠 5 𝑎𝑛𝑑 6

𝑀𝑒𝑑𝑖𝑎𝑛 = 5𝑡𝑕𝑜𝑏𝑠 + 6𝑡𝑕𝑜𝑏𝑠

2=

46 + 47

2= 46.5

b) For discrete frequency distribution:

Steps:

1. Prepare Cumulative frequency (C.F.) less than type table.

2. Find value of 𝑁

2 where 𝑁 = 𝑓𝑖

3. See C.F. just greater than 𝑁

2.

4. Observation corresponding to C.F. just greater than 𝑁

2 is Median.

The frequency distribution table in this looks like,

Observations (𝑥𝑖) Frequencies (𝑓𝑖 ) C.F.

Ex.1. Find median for following.

Obs 1 2 3 4 5 6 7 8 9

Freq 8 10 11 16 20 25 15 9 6

Ans: Prepare table to find cumulative frequency less then type.

𝒙𝒊 𝒇𝒊 CF

1

2

3

4

5

6

7

8

9

8

10

11

16

20

25

15

9

6

8

18

29

45

65

90

105

114

120

Total 120

Page 37: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

29

𝑁

2=

120

2= 60

CF just greater than 60 = 65

Hence, median is obs corresponding to 65

𝑀𝑒𝑑𝑖𝑎𝑛 = 5

Ex.2. Find median for following.

Obs 5 10 15 20 25 30 35

Freq 1 3 13 17 27 36 38

Ans: Prepare table to find cumulative frequency less then type.

𝒙𝒊 𝒇𝒊 CF

5

10

15

20

25

30

1

3

13

17

27

36

1

4

17

34

61

97

35 38 135

Total 135

𝑁

2=

135

2= 67.5

CF just greater than 67.5 = 97

Hence, median is obs corresponding to 97

𝑀𝑒𝑑𝑖𝑎𝑛 = 30

c) For continuous frequency distribution:

Steps:

1. Prepare cumulative frequency (C.F.) less than type table.

2. Find value of 𝑁

2 where 𝑁 = 𝑓𝑖

3. See C.F. just greater than 𝑁

2. The class interval corresponding to C.F. is Median class.

4. Find C.F. of pre-median class and denote it by ‘c’.

5. Hence Median is given by,

𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑳+ 𝒉

𝒇 ( 𝑵

𝟐− 𝒄 )

Where, 𝐿 = lower limit of Median class

𝑕 = Width of Median class

𝑓 = Frequency of Median class

𝑐 = C.F. of Pre-median class

Page 38: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

30

The frequency distribution table in this looks like,

Ex.1. Find median.

Class 20-30 30-40 40-50 50-60 60-70

Freq 14 23 27 21 15

Ans:

𝑁

2=

100

2= 50

CF just greater than 50 = 64

Hence, Median class= 40-50

Here, 𝐿 = 40 𝑕 = 10 𝑓 = 27 𝑐 = 37

𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑳+ 𝒉

𝒇 ( 𝑵

𝟐− 𝒄 )

= 40 +10

27(50− 37)

= 40 + 4.814

= 44.814

Ex.2. Calculate median.

Class 20-25 25-30 30-35 35-40 40-45 45-50

Freq 100 140 200 320 300 240

Class intervals Frequencies (𝑓𝑖 ) C.F.

classes 𝒇𝒊 CF

20-30 14 14

30-40 23 37

40-50 27 64

50-60 21 85

60-70 15 100

Total 100

Page 39: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

31

Ans: Prepare following frequency distribution table:

𝑁

2=

1300

2= 650

CF just greater than 650 = 760

Hence, Median class= 35-40

Here, 𝐿 = 35 𝑕 = 5 𝑓 = 320 𝑐 = 440

𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑳 + 𝒉

𝒇 ( 𝑵

𝟐− 𝒄 )

= 35 +5

320(650− 440)

= 35 + 3.28

= 38.28

Ex.3. Find median from following data.

Class 0-10 10-20 20-30 30-40 40-50

Freq 8 15 22 15 8

Ans:

classes 𝒇 𝒊 CF

20-25 100 100

25-30 140 240

30-35 200 440

35-40 320 760

40-45 300 1060

45-50 240 1300

Total 1300

classes 𝒇 𝒊 CF

0-10 8 8

10-20 15 23

20-30 22 45

30-40 15 60

40-50 8 68

Total 68

Page 40: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

32

𝑁

2=

68

2= 34

CF just greater than 34 = 45

Hence, Median class= 20-30

Here, 𝐿 = 20 𝑕 = 10 𝑓 = 22 𝑐 = 23

𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑳+ 𝒉

𝒇 ( 𝑵

𝟐− 𝒄 )

= 20 +10

22(34− 23)

= 20 + 5

= 25

Merits of Median:

(i) It is rigidly defined.

(ii) It is easy to understand and easy to calculate.

(iii) It is not affected by extreme values.

(iv) Even if extreme values are not known median can be calculated.

(v) It can be located just by inspection in many cases.

(vi) It is the unique value.

(vii) It is not much affected by sampling fluctuations.

(viii) It can be calculated for data based on ordinal scale i.e. on ordering.

Demerits of Median:

(i) It is not based upon all values of the given data.

(ii) For larger data size the arrangement of data in the increasing order is difficult process.

(iii) It is not capable of further mathematical treatment.

(iv) It is insensitive to some changes in the data values.

3. Mode:

The mode is the value that occurs most frequently in a set of observations. It is an

observation which repeats maximum number of times. The mean and median require a calculation

but the mode is found simply by counting the number of times each value occurs in a data set.

Sometimes we may come across a distribution having more than one mode. If there are two modes

then it is bimodal distribution. Likewise if there are more modes then it is multi-modal distribution.

Calculation of Mode:

a) For raw data:

An observation repeating maximum number of times is mode.

Page 41: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

33

Ex.1. Find mode for following data: 61, 62, 63, 61, 63, 64, 64, 64, 60, 65.

Ans: 𝑀𝑜𝑑𝑒 = 𝑜𝑏𝑠 𝑟𝑒𝑝𝑒𝑎𝑡𝑖𝑛𝑔 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑡𝑖𝑚𝑒𝑠 = 64

b) For discrete frequency distribution:

An observation corresponding to the highest frequency in the table is mode.

Ex.1. Find mode.

Size 5 10 15 20 25 30 35

Freq 1 3 13 36 27 17 5

Ans: 𝑀𝑜𝑑𝑒 = 𝑜𝑏𝑠 𝑤𝑖𝑡𝑕 𝑕𝑖𝑔𝑕𝑒𝑠𝑡 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦(36) = 20

c) For continuous frequency distribution:

Steps:

1. Find the maximum frequency in the table denoted by 𝑓𝑚 .

2. The class interval corresponding to 𝑓𝑚 is Modal class.

3. Find frequency of Pre-modal class ( 𝑓1 ) and frequency of Post-modal class ( 𝑓2 ).

4. Hence Mode is given by,

𝑴𝒐𝒅𝒆 = 𝑳 + 𝒇𝒎− 𝒇𝟏

𝟐𝒇𝒎− 𝒇𝟏− 𝒇𝟐 𝒉

Where, 𝐿 = lower limit of Modal class

𝑓𝑚 = maximum frequency

𝑓1 = frequency of pre-modal class

𝑓2 = frequency of post-modal class

h = width of modal class.

Ex.1. Calculate mode.

Classes 20-30 30-40 40-50 50-60 60-70 70-80 80-90

Freq 28 32 45 60 56 40 20

Ans:

Classes 20-30 30-40 40-50 50-60 60-70 70-80 80-90

Freq 28 32 45

(𝒇𝟏)

60

(𝒇𝒎)

56

(𝒇𝟐)

40 20

Here, 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 𝑓𝑚 = 60

Modal class = 50 – 60

Hence, 𝐿 = 50 𝑕 = 10 𝑓1 = 45 𝑓2 = 56 𝑓𝑚 = 60

𝑴𝒐𝒅𝒆 = 𝑳+ 𝒇𝒎− 𝒇𝟏

𝟐𝒇𝒎− 𝒇𝟏− 𝒇𝟐 𝒉

Page 42: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

34

= 50 + 60− 45

120− 45− 56 10

= 50 + 7.89

= 57.89

Ex.2. determine mode.

Classes 0-100 100-200 200-300 300-400 400-500

Freq 28 32 45 60 56

Ans:

Classes 0-100 100-200 200-300 300-400 400-500

Freq 12 18

(𝒇𝟏)

27

(𝒇𝒎)

20

(𝒇𝟐)

17

Here, 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 𝑓𝑚 = 27

Modal class = 200 – 300

Hence, 𝐿 = 200 𝑕 = 100 𝑓1 = 18 𝑓2 = 20 𝑓𝑚 = 27

𝑴𝒐𝒅𝒆 = 𝑳 + 𝒇𝒎− 𝒇𝟏

𝟐𝒇𝒎− 𝒇𝟏− 𝒇𝟐 𝒉

= 100 + 27− 18

54− 18− 20 100

= 100 + 56.25

= 156.25

Measures of dispersion:

We have learnt about the various measures of central tendency. Measures of central

tendency give us an idea of concentration of the observations about the central part of the data, but

it cannot describe the distribution completely. If we know the average or mean alone of certain

distribution, we cannot form a complete idea about the observations of that distribution; because

there may be different sets of observations having the same arithmetic mean. But these sets of

observations may differ or vary in their values about the measures of central tendency. A measure

of central tendency is a single value that represents a characteristic such as age or height of a group

of persons while a measure of dispersion quantifies how much persons in the group vary from each

other and from the measure of central tendency. But can the central tendency describe the data

fully or adequately?

To understand it, consider the following example.

Page 43: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

35

The daily income of the workers in two factories is:

Factory A 35 45 50 65 70 90 100

Factory B 60 65 65 65 65 65 70

Here, in both the groups the mean of the data is the same i.e. 65; but

(i) In group A, the observations are much more scattered from the mean.

(ii) In group B, almost all the observations are concentrated around the mean.

Thus, the two groups differ even though they have same mean. And hence we need to

differentiate between two groups. We need some measures which can measure the degree of

scatteredness.

Dispersion:

Scattering of data is also known as dispersion. According to W.I. King, “the term dispersion is

used to indicate the facts within a given group, the items differ from one another in size or in other

words, there is lack of uniformity in their sizes.” Spiegel defines it as, “the degree to which

numerical data tends to spread about an average value is called variation or dispersion of the data.”

Similarly, in the words of A.L. Bowley, “Dispersion is a measure of variation of the items.”

It is clear from these definitions that the deviation or variation of each observation from the

central value (i.e. mean, median or mode) is called dispersion or scattering of data. Dispersion is

defined as the degree of variation or deviation of each observation from the central value of the

distribution. The single value which describes the variability or scatterings of observations from

central value is called Measure of Dispersion. The measures of dispersion give the extent to which

the observation varies from the average of data. They help in studying the important characteristics

if the data.

If 𝑥1 ,𝑥2 , . . . , 𝑥𝑛 are the observations in the given data and A is any measure of central

tendency i.e. mean/median/mode then the deviation or dispersion is given by

Deviation = 𝑥𝑖 − 𝐴

Criteria for an Ideal Measure of Dispersion:

(i) It should be properly defined.

(ii) It should be easy to understand and easy to calculate.

(iii) It should be based on all the observations.

(iv) It should not be affected by the sampling fluctuations.

The following are the important measures of dispersion:

1. Range

2. Mean Deviation (MD)

3. Variance and Standard Deviation (SD)

4. Quartile Deviation (QD)

At the same time, we will calculate relative measures of dispersions as:

Page 44: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

36

i. Coefficient of Range

ii. Coefficient of Mean Deviation

iii. Coefficient of Variation

iv. Coefficient of Quartile Deviation

1. Range:

Range is the quickest and simplest measure of dispersion. It accounts only the difference

between the highest and the lowest observation in any data. For a given set of data, range is defined

as the difference between the highest (maximum) and lowest (minimum) observation. The range is

often reported as “from (the minimum) to (the maximum),” i.e., two numbers.

Coefficient of Range: is given by

𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑅𝑎𝑛𝑔𝑒 = 𝐿 − 𝑆

𝐿 + 𝑆 × 100

Where, L = largest value of data

S = smallest value of data

a) For raw data and discrete frequency distribution:

Range is the difference between the highest value and the lowest value of the data. If L is the

largest (highest) value and S is the smallest (lowest) value of the observations in the data then

𝑅𝑎𝑛𝑔𝑒 = 𝐿 − 𝑆

Ex.1. The marks obtained by 10 students in Mathematics are given below. Find the range.

15, 16, 16, 29, 11, 23, 35, 25, 19, 20

Ans: Here, Largest value = 𝐿 = 35

Smallest value = 𝑆 = 11

𝑅𝑎𝑛𝑔𝑒 = 𝐿 − 𝑆 = 35 – 11 = 24.

Ex.2. Calculate range and coefficient of range for following data:

Marks 10 15 20 25 30

No. Of students 7 8 13 12 10

Ans: Here, Largest value = 𝐿 = 30

Smallest value = 𝑆 = 10

𝑅𝑎𝑛𝑔𝑒 = 𝐿 − 𝑆 = 30 – 10 = 20.

𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑅𝑎𝑛𝑔𝑒 = 𝐿 − 𝑆

𝐿 + 𝑆 × 100

=30 − 10

30 + 10 × 100

=20

40 × 100

= 50%

Page 45: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

37

b) For continuous frequency distribution table:

In case of continuous frequency distribution, range is the difference between upper limit of

highest class interval and lower limit of lowest class interval. It is also calculated as the difference

between the mid values of the highest and lowest class intervals.

Ex.1. Calculate range and coefficient of range:

Weights in kg. 50 – 55 55 - 60 60 - 65 65 - 70 70 - 75

No. Of students 12 18 23 10 3

Ans: Here, Highest class interval: 70 – 75 so, Largest value = 𝐿 = 75

Lowest class interval: 50 – 55 so, smallest value = 𝑆 = 50

𝑅𝑎𝑛𝑔𝑒 = 𝐿 − 𝑆 = 75 – 50 = 25

𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑅𝑎𝑛𝑔𝑒 = 𝐿 − 𝑆

𝐿 + 𝑆 × 100

=75 − 50

75 + 50 × 100

=25

125 × 100

= 20%

Merits of Range:

(i) It is rigidly defined.

(ii) It gives rough but quick answer.

(iii) It is simple to understand and easy to calculate. It can be found by mere inspection.

(iv) It can be calculated from extreme values only. So we need not know the details of the series

to calculate the range.

Demerits of range:

(i) It is not representative since it is not based on all the observations of the series.

(ii) It is not capable of further algebraic treatment.

(iii) In case of open-end classes range cannot be determined exactly.

(iv) It is not a stable measure of dispersion and is very much affected by the fluctuations of

sampling.

2. Mean Deviation (MD):

Mean deviation is also known as average deviation. Mean deviation about any central value

A i.e. MD(A) is defined as the arithmetic mean of deviations of all observations taken from measure

of central tendency. While calculating mean deviation, the algebraic signs (+ or -) of the deviations

are ignored and deviations are taken positive using modulus ( ).

Page 46: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

38

Coefficient of Mean Deviation: It is common for all the following three types of data and is given

by,

𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓 𝑀𝐷 𝐴 = 𝑀𝐷

𝐴 × 100 Where, A= mean/ median / mode

a) For raw data:

Steps: If 𝑥1 ,𝑥2 ,… 𝑥𝑛 are n observations then

𝑥𝑖 𝑥𝑖 − 𝐴

1. Find the required measure of central tendency asked in example.

2. Prepare the frequency distribution table of two columns and calculate positive

deviations in second column as shown below.

3. Find the total of deviations.

4. Then MD is calculated by,

𝑀𝐷(𝐴) = 𝑥𝑖− 𝐴

𝑛 Where, 𝐴 = 𝑚𝑒𝑎𝑛/𝑚𝑒𝑑𝑖𝑎𝑛/𝑚𝑜𝑑𝑒.

Ex.1. calculate M.D. about median for: 1, 2, 3, 4, 5, 6, 7, 8, 9

Ans: Ascending order: 1, 2, 3, 4, 5, 6, 7, 8, 9

𝑛 = 9

𝑀𝑒𝑑𝑖𝑎𝑛 = 𝐴 = 5

𝑀𝐷(5) = 𝑥𝑖 − 5

9

= 20

9

= 2.223

𝑥𝑖 𝑥𝑖 − 5

1

2

3

4

5

6

7

8

9

4

3

2

1

0

1

2

3

4

Total 20

Page 47: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

39

Ex.2. Find mean deviation about mean and mode: 2, 5, 7, 8, 7, 6, 12, 3

Ans: Mean fore given data,

𝑥 = 𝑥𝑖𝑛

=50

8= 6.25 ≅ 6

Mode for given data is, 𝑚𝑜𝑑𝑒 = 7

𝑀𝐷 6 = 𝑥𝑖−6

8=

18

8= 2.25

𝑀𝐷 7 = 𝑥𝑖− 7

8=

18

8= 2.25

b) For discrete frequency distribution:

Steps: If 𝑥1 ,𝑥2 ,…𝑥𝑛 are n observations along with the corresponding frequencies 𝑓1 ,𝑓2,…𝑓𝑛 then

1. Find the required measure of central tendency asked in example.

2. Prepare the frequency distribution table of four columns and calculate positive

deviations and their products with corresponding frequencies as shown below.

3. Find the total of frequency column and last column (𝑓𝑖 𝑥𝑖 − 𝐴 )

4. Then MD is calculated by

𝑀𝐷(𝐴) = 𝑓𝑖 𝑥𝑖− 𝐴

𝑓𝑖 Where, 𝐴 = 𝑚𝑒𝑎𝑛 / 𝑚𝑒𝑑𝑖𝑎𝑛 / 𝑚𝑜𝑑𝑒

Ex.1. Calculate M.D. from mean.

𝑥𝑖 10 11 12 13 14

𝑓𝑖 2 5 7 3 1

𝑥𝑖 𝑥𝑖 − 6 𝑥𝑖 − 7

2

5

7

8

7

6

12

3

4

1

1

2

2

0

6

3

5

2

0

1

0

1

5

4

Total 18 18

𝑥𝑖 𝑓𝑖 𝑥𝑖 − 𝐴 𝑓𝑖 𝑥𝑖 − 𝐴

Page 48: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

40

Ans:

𝐴 = 𝑥 = 𝑥𝑖 ∙ 𝑓𝑖 𝑓𝑖

=212

18= 11.78 ≅ 12

𝑀𝐷 12 = 𝑓𝑖 𝑥𝑖 − 12

𝑓𝑖=

22

18= 1.22

Ex.2. Calculate M.D. from median.

𝑥𝑖 1 2 3 4 5

𝑓𝑖 8 10 15 5 2

Ans:

𝑁

2= 𝑓𝑖

2= 20

c.f. just greater than 20 = 33

𝐴 = 𝑚𝑒𝑑𝑖𝑎𝑛 = 3

𝑀𝐷 3 = 𝑓𝑖 𝑥𝑖 − 3

𝑓𝑖=

35

40= 0.875

Ex.3. Calculate M.D. from mode.

𝑥𝑖 10 12 15 17 19

𝑓𝑖 2 4 10 5 1

𝑥𝑖 𝑓𝑖 𝑥𝑖 ∙ 𝑓𝑖 𝑥𝑖 − 12 𝑓𝑖 𝑥𝑖 − 12

10

11

12

13

14

2

5

7

3

1

20

55

84

39

14

2

1

0

3

4

4

5

0

9

4

Total 18 212 22

𝑥𝑖 𝑓𝑖 c.f. 𝑥𝑖 − 3 𝑓𝑖 𝑥𝑖 − 3

1

2

3

4

5

8

10

15

5

2

8

18

33

38

40

2

1

0

1

2

16

10

0

5

4

Total 40 35

Page 49: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

41

Ans:

𝐴 = 𝑚𝑜𝑑𝑒 = 15

𝑀𝐷 15 = 𝑓𝑖 𝑥𝑖 − 15

𝑓𝑖=

36

22= 1.63

c) For continuous frequency distribution:

Steps: For mid-values 𝑋1 ,𝑋2 … 𝑋𝑛 of the given class intervals with the corresponding

frequencies 𝑓1 ,𝑓2... 𝑓𝑛

1. Find the required measure of central tendency asked in example.

2. Prepare the frequency distribution table of five columns. Write mid values (𝑋𝑖) of classes in

third column then calculate positive deviations with mid values and their products with

corresponding frequencies as shown below.

3. Find the total of frequency column and last column (𝑓𝑖 𝑋𝑖 − 𝐴 )

4. Then MD is calculated by

𝑀𝐷(𝐴) = 𝑓𝑖 𝑋𝑖− 𝐴

𝑓𝑖 Where, 𝐴 = 𝑚𝑒𝑎𝑛 / 𝑚𝑒𝑑𝑖𝑎𝑛 / 𝑚𝑜𝑑𝑒

Ex.1. Calculate M.D. from mean for following data.

Classes 0-10 10-20 20-30 30-40 40-50

Freq 5 8 15 16 6

Ans:

𝑥𝑖 𝑓𝑖 𝑥𝑖 − 15 𝑓𝑖 𝑥𝑖 − 15

10

12

15

17

19

2

4

10

5

1

5

3

0

2

4

10

12

0

10

4

Total 22 36

Classes 𝑓𝑖 Mid values (Xi) 𝑋𝑖 − 𝐴 𝑓𝑖 𝑋𝑖 − 𝐴

Classes 𝑓𝑖 Mid values (Xi) 𝑓𝑖 ∙ 𝑋𝑖 𝑋𝑖 − 27 𝑓𝑖 𝑋𝑖 − 27

0-10

10-20

20-30

30-40

40-50

5

8

15

16

6

5

15

25

35

45

25

120

375

560

270

22

12

2

8

18

110

96

30

128

108

Total 50 1350 472

Page 50: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

42

𝒙 = 𝑓𝑖 ∙𝑋𝑖

𝒇𝒊=

1350

50= 27

Hence,

𝑀𝐷 27 = 𝑓𝑖 𝑋𝑖− 27

𝑓𝑖=

472

50= 9.44

Ex.2. Calculate M.D. and coefficient of M.D about median for following data.

Classes 0-10 10-20 20-30 30-40 40-50

Freq 8 15 22 15 8

Ans:

𝑁

2=

68

2= 34

C.F. just greater than 34 = 45

𝑀𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠 = 20− 30

𝐿 = 20 𝑕 = 10 𝑓 = 22 𝑐 = 23

𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑨 = 𝑳 + 𝒉

𝒇 ( 𝑵

𝟐− 𝒄 )

= 20 +10

22(34− 23)

= 20 + 5

= 25

Hence,

𝑀𝐷 25 = 𝑓𝑖 𝑋𝑖− 25

𝑓𝑖=

620

68= 9.12

𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓 𝑀𝐷 𝐴 = 𝑀𝐷

𝐴 × 100 =

9.12

25× 100 = 36.48%

Classes 𝑓𝑖 C.F. Mid values (Xi) 𝑋𝑖 − 25 𝑓𝑖 𝑋𝑖 − 25

0-10

10-20

20-30

30-40

40-50

8

15

22

15

8

8

23

45

60

68

5

15

25

35

45

20

10

0

10

20

160

150

0

150

160

Total 68 620

Page 51: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

43

Ex.3. Calculate mean deviation (M.D.) about mean.

Classes 11-13 13-15 15-17 17-19 19-21 21-23 23-25

Freq 6 53 85 56 21 16 8

Ans:

𝑁

2=

68

2= 34

C.F. just greater than 34 = 45

𝑀𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠 = 20− 30

𝐿 = 20 𝑕 = 10 𝑓 = 22 𝑐 = 23

𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑨 = 𝑳 + 𝒉

𝒇 ( 𝑵

𝟐− 𝒄 )

= 20 +10

22(34− 23)

= 20 + 5

= 25

Hence,

𝑀𝐷 25 = 𝑓𝑖 𝑋𝑖− 25

𝑓𝑖=

620

68= 9.12

Ex.4. Calculate M.D. from median.

C.I. 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70

Freq 6 12 17 30 10 10 8 5 2

Classes 𝑓𝑖 C.F. Mid values (Xi) 𝑋𝑖 − 25 𝑓𝑖 𝑋𝑖 − 25

0-10

10-20

20-30

30-40

40-50

8

15

22

15

8

8

23

45

60

68

5

15

25

35

45

20

10

0

10

20

160

150

0

150

160

Total 68 620

Page 52: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

44

Ans:

𝑁

2=

100

2= 50

C.F. just greater than 50 = 65

𝑀𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠 = 40− 45

𝐿 = 40 𝑕 = 5 𝑓 = 30 𝑐 = 35

𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑨 = 𝑳 + 𝒉

𝒇 ( 𝑵

𝟐− 𝒄 )

= 40 +5

30(50− 35)

= 40 + 2.5

= 42.5

Hence,

𝑀𝐷 42.5 = 𝑓𝑖 𝑋𝑖− 42.5

𝑓𝑖=

715

100= 7.15

Ex.5. Calculate M.D. from median.

C.I. 20-25 25-30 30-35 35-40 40-45 45-50

Freq 10 14 20 36 30 24

Classes 𝑓𝑖 C.F. Mid values (Xi) 𝑋𝑖 − 42.5 𝑓𝑖 𝑋𝑖 − 42.5

25-30

30-35

35-40

40-45

45-50

50-55

55-60

60-65

65-70

6

12

17

30

10

10

8

5

2

6

18

35

65

75

85

93

98

100

27.5

32.5

37.5

42.5

47.5

52.5

57.5

62.5

67.5

15

10

5

0

5

10

15

20

25

90

120

85

0

50

100

120

100

50

Total 100 715

Page 53: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

45

Ans:

𝑁

2=

134

2= 67

C.F. just greater than 67 = 80

𝑀𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠 = 35− 40

𝐿 = 35 𝑕 = 5 𝑓 = 36 𝑐 = 44

𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑨 = 𝑳 + 𝒉

𝒇 ( 𝑵

𝟐− 𝒄 )

= 35 +5

36(67− 44)

= 35 + 3.19

= 38.19 ≅ 38.5

Hence,

𝑀𝐷 38.5 = 𝑓𝑖 𝑋𝑖− 38.5

𝑓𝑖=

806

134= 6.015

Merits of Mean Deviation:

(i) It is simple to understand and easy to calculate.

(ii) It is based on all the observations.

(iii) It shows the dispersion or scatter of the various items of a series from any measure of

central tendency.

(iv) It is not very much affected by the values of extreme items of a series.

(v) It facilitates comparison between different items of a series.

(vi) It truly represents the average of deviations of the items.

Classes 𝑓𝑖 C.F. Mid values (Xi) 𝑋𝑖 − 38.5 𝑓𝑖 𝑋𝑖 − 38.5

20-25

25-30

30-35

35-40

40-45

45-50

10

14

20

36

30

24

10

24

44

80

110

134

22.5

27.5

32.5

37.5

42.5

47.5

16

11

6

1

4

9

160

154

120

36

120

216

Total 134 806

Page 54: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

46

Demerits of Mean Deviation:

(i) It is not rigidly defined as it is calculated from any central value viz. Mean, Median, Mode etc.

and hence it can produce different results.

(ii) It violates the algebraic principle by ignoring the + and – signs while calculating the

deviations of the different items from the central value.

(iii) It is not capable of further algebraic treatment.

(iv) It is affected much by the fluctuations in sampling.

(v) It is difficult to calculate when the actual value of an average comes out in fraction or

recurring figure for that in such a case it requires to use approximate value.

3. Variance or Standard Deviation (SD):

Among all measures of dispersion Standard Deviation (or variance) is considered superior

because it possesses almost all the requisite characteristics of a good measure of dispersion.

Variance helps us in isolating the effects of various factors. To calculate variance, the deviations of

observations are squared and added. This addition is then divided by the total number of

observations. Thus, variance is defined as the arithmetic mean of squares of deviations taken from

the mean of given observations.

The positive square root of the variance is called the standard deviation i.e. the positive

square root of arithmetic mean of squares of deviations about mean is known as standard deviation.

𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑆𝐷2 i.e. 𝑆𝐷 = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒

Coefficient of variation: It is same for all three types of data and given by.

𝐶𝑉 = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

𝑀𝑒𝑎𝑛 × 100

a) For raw data:

Steps: If 𝑥1 , 𝑥2, . . . , 𝑥𝑛 are n observations then

1. Calculate mean of given observations.

2. Prepare the frequency distribution table of three columns, find deviations about mean in

second column and their squares in third column as shown below.

𝑥𝑖 𝑥𝑖 − 𝑥 𝑥𝑖 − 𝑥 2

3. Find the total of last column 𝑥𝑖 − 𝑥 2.

4. Then SD is calculated as,

𝜎 = 𝑥𝑖− 𝑥 2

𝑛 Where 𝑥 =

𝑥𝑖

𝑛

Page 55: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

47

OR

𝜎 = 𝑥𝑖

2

𝑛− (𝑥 )2

Ex.1. Find S.D. and C.V.: 2, 5, 7, 4, 3, 9

Ans: 𝑚𝑒𝑎𝑛 = 𝑥 = 𝑥𝑖

𝑛=

30

6= 5

Prepare the table:

𝑥𝑖 𝑥𝑖 − 5 𝑥𝑖 − 5 2

2

5

7

4

3

9

3

0

2

1

2

4

9

0

4

1

4

16

Total 34

𝜎 = 𝑥𝑖 − 𝑥 2

𝑛=

34

6= 5.67 = 2.38

𝐶𝑉 = 𝑆 𝐷

𝑥 × 100 =

2.38

5× 100 = 47.6%

Ex.2. If 𝑛 = 10 𝑥 = 40 𝑥2 = 520 then find S.D.

Ans: Here, 𝑥 = 𝑥𝑖

𝑛=

40

10= 4

Use formula,

𝜎 = 𝑥𝑖

2

𝑛− (𝑥 )2 =

520

10− 42 = 36 = 6

Ex.3. for a certain distribution of 25 observations, mean is 50 and S.D. is 4. Find coefficient of

variation (C.V.).

Ans: Use formula,

𝐶𝑉 = 𝑆 𝐷

𝑥 × 100 =

4

50× 100 = 8

Page 56: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

48

b) For discrete frequency distribution:

If 𝑥1 ,𝑥2 ,… 𝑥𝑛 are n observations along with the corresponding frequencies 𝑓1,𝑓2,… 𝑓𝑛 then S.D. is

given by

𝝈 = 𝒇𝒊 𝒙𝒊− 𝒙 𝟐

𝒇𝒊 Where 𝑥 =

𝑓𝑖 𝑥𝑖

𝑓𝑖

OR

𝝈 = 𝒇𝒊 𝒙𝒊

𝟐

𝒇𝒊− (𝒙 )𝟐

The frequency distribution table is given by,

Ex.1. Find S.D. for following.

Age(Yr) 10 20 30 40 50

Freq 15 30 34 75 100

Ans: Prepare table as,

𝑥 = 𝑓𝑖 𝑥𝑖

𝑓𝑖=

9770

254= 38.47 ≅ 39

𝜎 = 𝑓𝑖 𝑥𝑖− 𝑥 2

𝑓𝑖=

48574

254= 191.24 = 13.83

c) For continuous frequency distribution:

For mid-values 𝑋1 ,𝑋2 … 𝑋𝑛 of the given class intervals with the corresponding

frequencies 𝑓1 ,𝑓2... 𝑓𝑛 the S.D. is given by

𝝈 = 𝒇𝒊 𝑿𝒊− 𝒙 𝟐

𝒇𝒊 Where 𝑥 =

𝑓𝑖 𝑋𝑖

𝑓𝑖

𝑋𝑖 = Mid values

OR

𝑥𝑖 𝑓𝑖 𝑥𝑖𝑓𝑖 𝑥𝑖 − 𝑥 𝑥𝑖 − 𝑥 2 𝑓𝑖 𝑥𝑖 − 𝑥 2

𝑥𝑖 𝑓𝑖 𝑥𝑖𝑓𝑖 𝑥𝑖 − 39 𝑥𝑖 − 39 2 𝑓𝑖 𝑥𝑖 − 39 2

10

20

30

40

50

15

30

34

75

100

150

600

1020

3000

5000

39

19

9

1

11

1521

361

81

1

121

22815

10830

2754

75

12100

Total 254 9770 48574

Page 57: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

49

𝝈 = 𝒇𝒊 𝑿𝒊

𝟐

𝒇𝒊− (𝒙 )𝟐

The frequency distribution table is given by,

Ex.1. Find S.D.

Marks 0-20 20-40 40-60 60-80 80-100

Students 5 12 32 40 11

Ans: Prepare table as,

𝑥 = 𝑓𝑖 ∙𝑋𝑖

𝑓𝑖=

5800

100= 58

𝜎 = 𝑓𝑖 𝑋𝑖− 𝑥 2

𝑓𝑖=

40000

100= 400 = 20

Ex.2. Calculate S.D. and C.V.

Class 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Freq 5 10 20 40 30 20 10 4

Ans: Prepare table as,

𝑥𝑖 𝑓𝑖 𝑋𝑖 𝑓𝑖𝑋𝑖 𝑋𝑖 − 𝑥 𝑋𝑖 − 𝑥 2 𝑓𝑖 𝑋𝑖 − 𝑥 2

CI 𝑓𝑖 𝑋𝑖 𝑓𝑖 ∙ 𝑋𝑖 𝑋𝑖 − 58 𝑋𝑖 − 58 2 𝑓𝑖 𝑋𝑖 − 58 2

0-20

20-40

40-60

60-80

80-100

5

12

32

40

11

10

30

50

70

90

50

360

1600

2800

990

48

28

8

12

32

2304

784

64

144

1024

11520

9408

2048

5760

11264

Total 100 5800 40000

CI 𝑓𝑖 𝑋𝑖 𝑓𝑖 ∙ 𝑋𝑖 𝑋𝑖 − 40 𝑋𝑖 − 40 2 𝑓𝑖 𝑋𝑖 − 40 2

0-10

10-20

20-30

30-40

40-50

50-60

60-70

70-80

5

10

20

40

30

20

10

4

5

15

25

35

45

55

65

75

25

150

500

1400

1350

1100

650

300

35

25

15

5

5

15

25

35

1225

625

225

25

25

225

625

1225

6125

6250

4500

1000

750

4500

6250

4900

Total 139 5475 34275

Page 58: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

50

𝑥 = 𝑓𝑖 ∙𝑋𝑖

𝑓𝑖=

5475

139= 39.40 ≅ 40

𝜎 = 𝑓𝑖 𝑋𝑖− 𝑥 2

𝑓𝑖=

34275

139= 246.58 = 15.70

𝐶𝑉 = 𝜎

𝑥 × 100 =

15.70

40× 100 = 39.25%

Ex.3. Calculate C.V.

Class 0-50 50-100 100-150 150-200 200-250 250-300

Freq 7 16 23 14 8 2

Ans: Prepare table as,

𝑥 = 𝑓𝑖 ∙𝑋𝑖

𝑓𝑖=

9050

709= 129.29 ≅ 130

𝜎 = 𝑓𝑖 𝑋𝑖− 𝑥 2

𝑓𝑖=

268750

70= 3839.29 = 61.96

𝐶𝑉 = 𝜎

𝑥 × 100 =

61.96

130× 100 = 47.66%

Merits of Standard Deviation:

(i) It is rigidly defined.

(ii) It is based on all the observations of the series and hence it is representative.

(iii) It strictly follows the algebraic principles, and it never ignores the + and – signs like the

mean deviation

(iv) It is capable of further algebraic treatment.

(v) It is least affected by fluctuations of sampling.

Demerits of Standard Deviation:

(i) It is relatively difficult to calculate and understand

(ii) It is more affected by extreme items.

(iii) It cannot be exactly calculated for a distribution with open-ended classes.

(iv) It cannot be used for comparing the dispersion of two, or more series given in different

units.

CI 𝑓𝑖 𝑋𝑖 𝑓𝑖 ∙ 𝑋𝑖 𝑋𝑖 − 130 𝑋𝑖 − 130 2 𝑓𝑖 𝑋𝑖 − 130 2

0-50

50-100

100-150

150-200

200-250

250-300

7

16

23

14

8

2

25

75

125

175

225

275

175

1200

2875

2450

1800

550

105

55

5

45

95

145

11025

3025

25

2025

9025

21025

77175

48400

575

28350

72200

42050

Total 70 9050 268750

Page 59: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

51

4. Quartile Deviation (QD):

Quartiles:

If we divide the given data into four equal parts then there are three values say Q1, Q2 and Q3

at four divisions which are known as quartiles i.e. quartiles are the three values which divide the

data into four equal parts. Each group is a quarter of given data. Q1 is called first quartile or lower

quartile, Q2 is second quartile or median and Q3 is the third quartile or upper quartile.

Quartile deviation is also known as semi inquartile range or semi quartile range 0r

inquartile range. It gives the average amount by which two quartiles differ from median. Its

calculation is a bit similar as median. The quartile deviation is half the difference between the third

quartile and the first quartile, and for this reason it is often called the semi-interquartile range. It is

given by

𝑄𝐷 =𝑄3 − 𝑄1

2

Coefficient of Quartile Deviation: it is same for all the following three types of data and is given

by,

𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓 𝑄𝐷 = 𝑄3 − 𝑄1

𝑄3 + 𝑄1 × 100

Calculation of Quartile Deviation:

a) For raw data:

Steps:

1. Arrange the observations in ascending order.

2. Find the positions of the quartiles.

(i) If number of observations is divisible by 4 then

1st quartile = 𝑄1 = 𝑛

4 th observation

3rd quartile = 𝑄3 = 3 𝑛

4 th observation

(ii) If number of observations is not divisible 4 then

1st quartile = 𝑄1 = 𝑛 + 1

4 th observation

3rd quartile = 𝑄3 = 3 𝑛 + 1

4 th observation

Note: If a quartile lies between observations, the value of the quartile is the value of the lower

observation plus the specified fraction of the difference between the observations. For example, if

the position of a quartile is 20¼, it lies between the 20th and 21st observations, and its value is the

value of the 20th observation, plus ¼ the difference between the value of the 20th and 21st

observations.

Page 60: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

52

3. Calculate quartile deviation using formula.

Ex.1. Calculate QD for the following data.

13, 7, 9, 15, 11, 5, 8, 4

Ans: Ascending order: 4, 5, 7, 8, 9, 11, 13, 15

Find the position of the 1st and 3rd quartiles.

Since there are 8 observations, 𝑛 = 8 (divisible by 4),

1st quartile = 𝑄1 = 𝑛

4 th obs =

8

4= 2nd obs = 5

3rd quartile = 𝑄3 = 3 𝑛

4 th obs = 3

8

4 = 6th obs = 11

Hence, 𝑄𝐷 =𝑄3 − 𝑄1

2=

11 − 5

2=

6

2= 3

Ex.2. Following are the marks obtained 10 students: 56, 48, 65, 35, 42, 75, 82, 60, 55, 50. Find

quartile deviation and its coefficient.

Ans: Ascending order: 35, 42, 48, 50, 55, 56, 60, 65, 75, 80

Here, 𝑛 = 10 (not divisible by 4)

1st quartile = 𝑄1 = 𝑛 + 1

4 th obs =

10 + 1

4

= 2.75th obs ...it lies in between 2nd & 3rd obs.

So, Q1 = 2nd obs +0.75 (3rd obs – 2nd obs)

= 42 + 0.75 (48 – 42)

= 42 + 4.5

Q1 = 46.5

3rd quartile = 𝑄3 = 3 𝑛 + 1

4 th obs = 3 ×

10 + 1

4

= 8.25th obs ... it lies in between 8th & 9th obs.

So, Q3 = 8th obs +0.25 (9th obs – 8th obs)

= 65 + 0.25 (75 – 65)

= 65 + 2.5

Q3 = 67.5

Hence,

𝑄𝐷 =𝑄3 − 𝑄1

2 =

67.5 − 46.5

2= 10.5

&

𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓 𝑄𝐷 = 𝑄3 − 𝑄1

𝑄3 + 𝑄1 × 100

=67.5 − 46.5

67.5 + 46.5 × 100

=21

114 × 100

=18.42 %

Page 61: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

53

b) For discrete frequency distribution:

Steps: It is obtained using cumulative frequency as in median.

2. Find the cumulative frequency less than type and prepare frequency table.

3. Calculate 𝑁

4, where 𝑁 = 𝑓𝑖 .

4. See the CF just greater than 𝑁

4 and its corresponding observation is 1st quartile Q1.

5. Now calculate 3𝑁

4.

6. See the CF just greater than 3𝑁

4 and its corresponding observation is 3rd quartile Q3.

7. Calculate quartile deviation using formula.

Ex.1. Calculate quartile deviation.

No. of goals scored 0 1 2 3 4

No. of matches 1 9 7 5 3

Ans: Prepare table as,

𝑥𝑖 𝑓𝑖 c.f.

0

1

2

3

4

1

9

7

5

3

1

10

17

22

25

Total 25

𝑁

4=

25

4= 6.25

C.f. just greater than 6.25=10

So, 𝑄1 = 𝑜𝑏𝑠 𝑐𝑜𝑟𝑟. 𝑡𝑜 6.25 = 1

3𝑁

4=

75

4= 18.75

C.f. just greater than 18.75=22

So, 𝑄3 = 𝑜𝑏𝑠 𝑐𝑜𝑟𝑟. 𝑡𝑜 18.75 = 3

c) For continuous frequency distribution:

Steps: It is obtained using cumulative frequency as in median.

1. Find the cumulative frequency less than type and prepare frequency table.

2. Calculate 𝑁

4, where 𝑁 = 𝑓𝑖 .

3. See the CF just greater than 𝑁

4 and its corresponding class is Q1 class.

4. For the 1st quartile, use the formula 𝑄1 = 𝐿 + 𝑕

𝑓 𝑁

4− 𝑐

Page 62: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

54

Where:

𝐿 = lower limit of Q1 class

𝑕 = width of Q1 class

𝑓 = frequency of Q1 class

𝑐 = cumulative frequency above the Q1 class

5. Now calculate 3𝑁

4.

6. See the CF just greater than 3𝑁

4 and its corresponding class is Q3 class.

7. For the 3rd quartile, use the formula 𝑄1 = 𝐿 + 𝑕

𝑓 𝑁

4− 𝑐

Where:

𝐿 = lower limit of Q3 class

𝑕 = width of Q3 class

𝑓 = frequency of Q3 class

𝑐 = cumulative frequency above the Q3 class

8. Calculate quartile deviation using formula.

Ex.1. Calculate quartile deviation and its coefficient for following data.

Classes 10–15 15–20 20-25 25-30 30-35 35-40 40-45 45-50

Freq 4 4 6 8 10 9 7 5

Ans: Prepare the following table

CI 𝑓𝑖 CF

10 – 15

15 – 20

20 – 25

25 – 30

30 – 35

35 – 40

40 – 45

45 - 50

4

4

6

8

10

9

7

5

4

8

14

22

32

41

48

53

Total

𝑓𝑖= 53

For the 1st quartile,

𝑁

4 =

𝑓𝑖

4=

53

4= 13.25

Cf just greater than 13.25 = 14

Q1 class = 20 – 25

L =20 f = 6 c = 8 h = 5

Page 63: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

55

𝑄1 = 𝐿 + 𝑕

𝑓 𝑁

4− 𝑐

= 20 +5

6 13.25− 8

= 20 +26.25

6

= 20 + 4.375

= 24.375

For the 3rd quartile,

3𝑁

4= 3

𝑓𝑖

4= 3

53

4= 3 × 13.25 = 39.75

Cf just greater than 39.75 = 41

Q1 class = 35 − 40

𝐿 = 35 𝑓 = 9 𝑐 = 32 𝑕 = 5

𝑄1 = 𝐿 + 𝑕

𝑓 𝑁

4− 𝑐

= 35 +5

9 39.75− 32

= 35 +38.75

9

= 35 + 4.305

= 39.305

Hence,

𝑄𝐷 =𝑄3 − 𝑄1

2 =

39.305 − 24.375

2 = 7.465

&

𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓 𝑄𝐷 = 𝑄3 − 𝑄1

𝑄3 + 𝑄1 × 100

= 39.305 − 24.375

39.305 + 24.375 × 100

= 7.465

63.68× 100

= 11.7226

Merits of Quartile Deviation:

(i) It is easy to calculate and simple to follow.

(ii) It is not affected by the extreme values and is, therefore, useful in skewed distributions.

(iii) It is the only method of dispersion applicable in case of ‘open-end classes’.

Demerits of Quartile Deviation:

(i) Since Quartile Deviation is based on Quartiles, sometimes it is not rigidly defined.

(ii) It is not based on all the observations in the series. Hence it is not representative.

(iii) It is not capable of further algebraic treatment.

(iv) It is not a stable measure of dispersion as it is affected very much by fluctuations of

sampling.

Page 64: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

56

Exercise

1. What do you mean by statistics? Add note on statistical data.

2. Explain following terms in frequency distribution and data;

i. Individual data

ii. Discrete data

iii. Grouped data

iv. Classes

v. Frequency

vi. Class limits

vii. Class boundaries

viii. Class frequency

ix. Class interval

3. Given data contain weight in kg of group of 60 students. Prepare a frequency table taking

magnitude of class interval as 10 kg and the first class interval equal to 40 and less than 50.

50 52 86 94 49 90 76 96 64 70

69 80 79 73 81 110 84 67 77 65

74 60 115 61 83 72 79 103 51 78

71 66 77 84 42 69 80 68 104 79

54 59 100 53 76 50 78 63 95 42

40 82 41 75 63 113 98 43 55 76

4. Prepare a discrete frequency table for following data containing number of defectives in a lot.

2, 3, 1, 0, 1, 2, 1, 0, 1, 4, 5, 3, 2, 1, 0, 1, 3, 4, 1 , 5, 4, 3, 1, 0, 0, 1, 0, 2, 3, 1, 2, 4, 5, 0, 1, 0, 1,

0, 2, 4, 3, 5, 0, 1, 3, 2, 1, 0, 2, 2, 3, 0, 1, 3, 4, 0, 1, 3, 2, 5, 0, 1, 2.

5. For each of the given frequency distribution draw Histogram, Frequency polygon and

Cumulative frequency polygon

i.

ii.

6. Following table gives the birth rate per thousand of different countries over certain period.

Represent the given data by a suitable diagram plotting the countries against their birth rate.

Weight in kg 80-90 90-100 100-110 110-120 120-130

No. of workers 07 11 15 08 04

Classes 0-10 10-20 20-30 30-40 40-50 50-60 60-70

Frequency 03 04 06 10 11 09 05

Page 65: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

57

Country Birth rate

India 41

Pakistan 35

Bangladesh 30

Srilanka 25

USA 20

UK 15

7. Draw a pie diagram for the following data of seventh five year plan of Government.

Agriculture 14%

Irrigation 13%

Health 27%

Education 15%

Social Development 16%

Employment 16%

Note: Angle of centre is given by 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑖𝑡𝑒𝑚

100 × 360

8. Draw a pie diagram to represent the following data of population in a town;

Males 2000

Females 1800

Boys 4200

Girls 2000

Total 10000

9. Find the average from following data

52,69,93,72,56,85,73,66,94,85.

10. Calculate the average for given data

Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90

No. of

Stude

nts

00 02 03 07 13 13 09 02 01

11. Given data contain marks obtained by a batch at 10 students in certain class test. Calculate

Median marks.

28, 35, 46, 47, 60, 30, 32, 62, 64, 63.

12. Calculate median weight from given weight in grams

68, 66, 35, 42, 26, 85, 44, 80, 33, 72.

Page 66: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

58

13. Calculate the median

Value >100 100-200 200-300 300-400 400 and above

Frequency 50 90 158 68 134

14. Calculate mean, mode and median for given data

X: 12, 13, 17, 18, 19, 19, 21, 22, 21, 27, 24, 30, 31, 31

15. Find the mean and mode from table

Classes 10-25 25-40 40-55 55-70 70-85 85-100

No. of students 06 50 44 26 03 01

16. Calculate mode and median

Monthly wage 20-30 30-40 40-50 50-60 60-70 70-80 80-90

No. of employees 28 32 45 60 56 40 20

17. From the following cumulative frequency table fine the mean, median and mode.

Size below 5 10 15 20 25 30 35

Frequency 1 3 13 17 27 36 38

18. From the index numbers given below calculate the range and its coefficient.

188, 178, 173, 164, 172, 183, 184, 185, 211, 217, 232, 240.

19. Calculate mean deviation about mean

X 10 11 12 13

f 04 11 20 15

20. Calculate mean deviation from median

X: 3484, 4572, 4124, 3682, 5624, 4388, 3680, 4308.

21. Calculate mean derivation from median and its coefficient.

Classes 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70

Freq 06 12 17 30 10 10 08 05 02

22. Calculate standard deviation from the following data;

Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70

No. of Students 05 07 14 12 09 06 02

23. Find S.D. of the data: 1, 2, 3, 4, 5, 6, 7, 8, 9

24. Calculate S.D. for: 15, 12, 9, 18, 21, 15

25. The coefficient of variation of certain distribution is 4 and mean is 60. Find S.D.

26. Calculate mean and S. D. from given data

Monthly pension in Rs. 40 50 60 70 80 100

No. of persons 03 06 04 09 03 05

Page 67: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

59

27. Calculate S.D. and C.V.

Class 0-10 10-20 20-30 30-40 40-50 50-60 60-70

Freq 4 6 10 20 10 6 4

28. Calculate S.D. and C.V.

Class 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Freq 1 19 30 80 70 26 10 4

29. Calculate S.D. by step deviation method

Class 140-160 160-180 180-200 200-220 220-240 240-260

Freq 12 13 55 40 35 28

30. Calculate quartile deviation from given data

Age 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

No. of persons 15 15 25 22 25 10 05 00

31. The following are the goals scored by a team. Calculate Q.D.

No. of goals scored 0 1 2 3 4

No. of matches 1 9 7 5 3

32. Lives of two models of refrigerators in a recent survey are;

Life (in years) Model A Model B

0-2 05 02

2-4 16 07

4-6 13 12

6-8 07 19

8-10 05 09

10-12 04 01

An analysis of the monthly wages paid to workers in two firms M and N of the same industry gives

the following results,

Parameters Firm M Firm N

No. of wage earners 58.6 648

Avg. monthly wages 52.5 47.5

Variance 100 121

Page 68: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

60

2. Probability

Introduction:

‘What is probability?’ Nobody has a really good answer to this question. It is the language

which we use to explain uncertainty. The theory of probability has been originated from the game

of gambling. The correspondence between two French mathematicians Blaise Pascal and Pierre

Fermat gave rise to the study of probability. Throughout the 18th century, the application of

probability moved from games of chance to scientific problems. In the study of statistics, we are

concerned basically with the presentation and interpretation of chance outcomes that occur in a

planned study or scientific investigation. Statisticians use the word experiment to describe any

process that generates a set of data.

Probability is a part of our everyday lives. Modern research in probability theory is closely

related to the field of measure theory. The development of probability theory has been stimulated

by the variety of its applications. Statistics is one important branch of applied probability. One of

the difficulties in developing theory of probability is the definition of probability. The search for a

widely acceptable definition took centuries and was marked by controversy. The matter was finally

resolved in the 20th century by treating probability theory on an axiomatic basis.

Probability:

Probability is the branch of statistics that studies the possible outcomes of given events

together with their likelihoods and distributions. In common use, word probability is used to mean

the chance that a particular event will occur.

e.g. It is likely to rain, there is 60% chance that India will win the match, etc.

Probability is the measure of the likelihood that an event will occur. Probability is

quantified as a number between 0 and 1 (where 0 indicates impossibility and 1 indicates

certainty). The higher the probability of an event, more certain we are that the event will occur. The

topic of probability is seen in many facets of the modern world. The theory of probability is not just

taught in mathematical courses, but can be seen in practical fields, such as insurance, industrial

quality control, study of genetics, quantum mechanics, and the kinetic theory of gases.

In order to clear the concept of probability, we have some basic concepts:

1. Random Experiment:

It is a repeating action in which all the possible results are known but the exact result is not

known in advance.

e.g. “Tossing of a coin” is a random experiment; because we know the result in advance i.e. either

Head or Tail but the exact one is not known.

2. Outcomes:

The results of the random experiments are called outcomes.

Page 69: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

61

E.g. consider random experiment: A die is thrown.

We may get the number from 1, 2, 3, 4, 5, or 6 on the uppermost face of the die.

So this experiment has six outcomes.

3. Sample space:

The set of all possible outcomes of a random experiment is called sample space. It is

denoted by ‘S’ or ‘Ω’. Each outcome in sample space is an element or member of that sample space

E.g. "Tossing of two coins at a time" has sample space as,

𝑆 = 𝐻𝐻,𝐻𝑇,𝑇𝐻,𝑇𝑇)

4. Equiprobable (Equally likely) sample space.

When each outcome of the random experiment is having equal chance to happen, we say

that the sample space is Equiprobable. It is recognised by the following underlined words.

(i) An unbiased coin is tossed.

(ii) A fair dice is thrown.

(iii) A card is drawn from the well shuffled pack of 52 cards.

(iv) A book is selected at once. Etc.

5. Event:

Any subset of a sample space is called an event. More than one event can occur in a random

experiment. Events are generally denoted by the capital letters like A, B, C ... etc.

E.g. Consider a random experiment of “throwing a fair die” has following events:

i) The number on the uppermost face of a die is odd.

ii) The number on the uppermost face of a die is divisible by 2.

Sample space, 𝑆 = 1, 2, 3, 4, 5, 6

Now,

Let A be the event that the no. On the upper face of the die is odd.

So, 𝐴 = 1, 3, 5

Let B be the event that the no. On the upper face is divisible by 2.

So,𝐵 = 2, 4, 6

Here, both the sets A and B are subsets of S.

Types of Events:

i) Simple event: Event containing single point is called simple event.

E.g. If two coins are tossed then 𝑆 = 𝐻𝐻,𝐻𝑇,𝑇𝐻,𝑇𝑇

Let A be the event that both coins shoe tail;

so 𝐴 = 𝑇𝑇 a single point

ii) Sure/certain event: An event which contains all the sample points of the sample space is called

as sure/certain event.

e.g. A card is drawn at random from the well shuffled pack of 52 cards.

Page 70: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

62

Let B be the event that card drawn is Red or Black. This event has outcomes of all 52 cards; so it is a

sure event.

iii) Impossible event: An event which does not contain any sample point of the sample space is an

impossible event.

e.g. A die is thrown. 𝑆 = 1, 2, 3, 4, 5, 6

Let C be the event that number on upper face is greater than 6.

So 𝐶 = = ∅

iv) Complementary event: Let A be the event of the sample space S. Then the complement of event

A is the set containing the points in S but not on A. It is denoted by 𝐴′ or 𝐴𝑐 .

e.g. A die is thrown. 𝑆 = 1, 2, 3, 4, 5, 6

Let D be the event of getting odd no. 𝐷 = 1, 3, 5

Then complement of D is 𝐷𝑐 = 2, 4, 6.

v) Mutually exclusive events: Two events say A and B are said to be mutually exclusive or disjoint

events if they have no common point i.e. 𝐴 𝐵 = ∅

e.g. Throwing of a die. 𝑆 = 1, 2, 3, 4, 5, 6

A be the event of occurring even no. on upper face. 𝐴 = 2, 4, 6

B be the event of occurring odd no. on upper face. 𝐵 = 1, 3, 5

Here 𝐴 𝐵 = ∅ i.e. they have no same elements.

So, A and B are mutually exclusive events.

vi) Exhaustive events: Two or more events are said to be exhaustive events if their union is a

sample space i.e. suppose A and B are events of S. A and B are exhaustive if 𝐴 𝐵 = S

Permutations and combinations:

The central theme of theory of permutations and combinations is to solve the counting

problems without doing any actual counting, which is inconvenient and at times difficult within

human limitations when the number of logical possibilities of an event is large.

Factorial notation: For any natural number n, the product (multiplication) of first n natural

numbers is denoted by n! And read as n factorial.

e.g. 5! = 5 × 4 × 3 × 2 × 1 = 120

Similarly, 100! = 1 × 2 × 3 × 4 ×… × 100

Permutation:

A permutation is an arrangement in a definite order of number of objects taken some or all

at a time.

e.g. Consider a three digit number 456. We want to make different numbers from these three digits

by taking two numbers at a time under the assumption that no number is repeated.

In this case, the two digits numbers formed are, 45, 46, 54, 56, 64, and 65.

Page 71: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

63

Hence, permutations of n different objects taken r at a time is the total number of ways in

which n objects can be arranged at r places in a line and it is given by

nPr = 𝑛 !

(𝑛 − 𝑟)!

In particular, if 𝑟 = 𝑛 then

nPr = 𝑛!

Ex.1. In how many ways 5 different objects can be arranged by taking 2 at a time?

Ans: Here, n = 5 and r = 2

5P2 = 5!

(5 − 2)! =

5×4×3×2×1

3×2×1 = 20 𝑤𝑎𝑦𝑠

Hence, there are 20 ways to arrange 5 different objects, taken 2 at a time.

Ex.2. Calculate the number of ways in which three people from a group of seven people can be

seated in a row.

Ans: This is a case of permutation since the order is important.

Here, 𝑛 = 7 𝑟 = 3

The number of possible ways is:

7P3= 7!

(7 − 3)! =

7×6×5×4×3×2×1

4×3×2×1 = 210 𝑤𝑎𝑦𝑠

Combinations:

A combination is a group of objects, irrespective of order, taken some or all at a time.

E.g. suppose there is a group of three students say X, Y and Z. We have to make different groups

containing two students in each group.

In this case, the groups formed are,

XY, XZ, YZ. (Here, XY is similar to YX.)

Hence, total number of combinations of n different objects taken r at a time is given by

nCr = 𝑛 !

𝑟 ! (𝑛 − 𝑟)!

In particular, if 𝑟 = 𝑛 then

nCr = 1

Ex.1. Find the total number of combinations of 10 objects taken 5 at a time.

Ans: Here, n = 10 and r = 5

10C5 = 10!

5! (10 − 5)! =

10×9×8×7×6×5!

5! × 5! =

6×7×8×9×10

5×4×3×2×1 = 252 combinations

Hence there are 252 combinations of 10 objects taken 5 at a time.

Page 72: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

64

Ex.2. Calculate the number of combinations in which three people can be selected from a group of

seven.

Ans: Here the order is not important so it is case of combination.

Here, 𝑛 = 7 𝑟 = 3

The number of possible combinations is:

7C3 = 7!

3! (7 − 3)! =

7×6×5×4!

3! × 4! =

7×6×5

3×2×1 = 35 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑠

Example to show the relationship between permutation and combination:

Thus, the number of permutations is always greater than the number of combinations.

Classical Definition of Probability:

Statistically, the term probability can be defined in following way:

If S is the sample space with n outcomes of a random experiment and A is an even with m

outcomes then probability of event A is denoted by P(A) and is defined as,

P (A) = 𝑁𝑜 .𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑖𝑛 𝐴

𝑁𝑜 .𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑖𝑛 𝑆 = 𝑚

𝑛

In short, if

𝑛(𝑆) = 𝑛𝑜. 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 𝑆

𝑛(𝐴) = 𝑛𝑜. 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑒𝑣𝑒𝑛𝑡 𝐴

then

𝑃 (𝐴) = 𝑛(𝐴)

𝑛(𝑆)

Probability Axioms and simple Properties:

Probabilities, however assigned, must satisfy three specific axioms:

Axiom 1: for any event A, 𝑃(𝐴) ≥ 0.

Axiom 2: 𝑃(𝑆) = 1.

Axiom 3: For any sequence of disjoint events A1, A2, A3, ...

𝑃 𝐴𝑖

𝑛

𝑖=1

= 𝐴𝑖

𝑛

𝑖=1

Number of

objects (n)

Taken at a

time (r)

Combination

(nCr)

Permutation

(nPr)

P, Q 2 PQ PQ, QP

P, Q, R 2 PQ, PR, QR PQ, PR, QP, QR, RP, RQ

P, Q, R 3 PQR PQR, PRQ, QPR, QRP, RPQ, RQP

Page 73: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

65

These axioms are all we need to develop a theory of probability, but there is a collection of

commonly used properties which follow directly from these axioms, and which we make extensive

use of when carrying out probability calculations.

Property A: Probability of Complementary event

𝑃 (𝐴𝑐) = 1 – 𝑃 (𝐴).

Property B: 𝑃(∅) = 0

Property C: If 𝐴 ⊆ 𝐵, then 𝑃(𝐴) ≤ 𝑃(𝐵).

Property D: Addition Property

𝑃(𝐴 𝐵) = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃(𝐴 𝐵)

Ex.1. In a box, there are 5 Aspirin, 6 Analgin and 10 Paracetamol. If one tablet is chosen at random

find the probability that:

i) it is Analgin.

ii) it is Aspirin or Paracetamol.

Ans: There are total 22 tablets in a box.

𝑛(𝑆) = 22

i) Let A be the event that tablet chosen is Analgin.

𝑛(𝐴) = 6

𝑃(𝐴) = 𝑛(𝐴)

𝑛(𝑆) =

6

22 = 0.2727

ii) Let B be the event that tablet chosen is Aspirin or Paracetamol.

There are 5 Aspirin + 10 Paracetamol tablets

𝑛(𝐵) = 15

𝑃(𝐵) = 𝑛(𝐵)

𝑛(𝑆) =

15

22 = 0.6818

Ex.2. A card is selected at random from well shuffled pack of 52 cards. Find the probability of

getting

i) a face card ii) a red card iii) not a club card.

Ans: There are total 52 cards in a pack.

𝑛(𝑆) = 52

i) Let A be the event of getting a face card

number of Face cards = 13

𝑛(𝐴) = 13

𝑃(𝐴) = 𝑛(𝐴)

𝑛(𝑆) =

13

52 = 0.25

Page 74: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

66

ii) Let B be the event of getting a red card

Number of Red cards = 13 Heart + 13 Diamond

𝑛(𝐵) = 26

𝑃(𝐵) = 𝑛(𝐵)

𝑛(𝑆) =

26

52 = 0.50

iii) Let C be the event of getting a Club card

𝐶𝑐 = complement of C i.e. not getting a club card

Number of Club cards = 13

𝑛(𝐶) = 13

𝑃(𝐶) = 𝑛(𝐶)

𝑛(𝑆) =

13

52 = 0.25

Now, 𝑃(𝐶𝑐) = 1 – 𝑃(𝐶) = 1 – 0.25 = 0.75

Ex.3. A pair of fair dice is thrown. Find the probability of getting,

(i) A number greater than 4 on each die.

(ii) Odd number on first die and 5 on second die.

(iii) Sum of points is 10

(iv) Same points on both dice.

Ans: A pair of fair dice is thrown.

𝑆 = (1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (2,1), (2,2), (2,3), (2,4), (2,5), (2,6),

(3,1), (3,2), (3,3), (3,4), (3,5), (3,6), (4,1), (4,2), (4,3), (4,4), (4,5), (4,6),

(5,1), (5,2), (5,3), (5,4), (5,5), (5,6), (6,1), (6,2), (6,3), (6,4), (6,5), (6,6)

𝑛(𝑆) = 36.

i) Let event A: getting a number greater than 4 on each die.

𝐴 = (5, 5), (5, 6), (6, 5), (6, 6) So 𝑛(𝐴) = 4

𝑃(𝐴) = 4

36=

1

9

ii) Let event B: getting odd number on 1st die and 5 on 2nd die.

𝐵 = 1, 5 , 3, 5 , 5, 5 So 𝑛(𝐵) = 3

𝑃(𝐵) = 3

36=

1

12

iii) ) Let event C: getting sum of points as 10.

𝐶 = (4, 6), (5, 5), (6, 4) So 𝑛(𝐶) = 3

𝑃(𝐶) = 3

36=

1

12

iv) ) Let event D: getting same points on both dice.

𝐷 = (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6) So 𝑛(𝐷) = 6

𝑃(𝐷) = 6

36=

1

6

Page 75: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

67

Conditional probability:

Conditional probability is the likelihood of an event or outcome occurring based on the

occurrence of a previous event or outcome. i.e. the probability of any event 'A' changes after

knowing that some other event B has occurred; It is known as the conditional probability of the

event A given that the event B has occurred. We write this as 𝑃(𝐴 | 𝐵).

If A and B are any 2 events with 𝑃(𝐵) > 0, then

𝑃(𝐴| 𝐵) = 𝑃( 𝐴 𝐵)

𝑃(𝐵)

Similarly, 𝑃(𝐵|𝐴) = 𝑃(𝐴 𝐵)

𝑃(𝐴) ; 𝑃(𝐴) > 0

Ex.1. You toss a fair coin three times. Given that you have observed at least one heads, what is the

probability that you observe at least two heads?

Ans: A coin is tossed three times.

𝑆 = 𝑇𝑇𝑇,𝑇𝑇𝐻,𝑇𝐻𝑇,𝑇𝐻𝐻,𝐻𝐻𝐻,𝐻𝐻𝑇,𝐻𝑇𝐻,𝐻𝑇𝑇 𝑛 𝑆 = 8

Let A be the event that at least one heads is observed.

𝐴 = 𝑇𝑇𝐻,𝑇𝐻𝑇,𝑇𝐻𝐻,𝐻𝐻𝐻,𝐻𝐻𝑇,𝐻𝑇𝐻,𝐻𝑇𝑇

𝑃 𝐴 =7

8

Let B be the event that at least two heads are observed.

𝐵 = 𝑇𝐻𝐻,𝐻𝐻𝐻,𝐻𝐻𝑇,𝐻𝑇𝐻

𝑃 𝐵 =4

8

Probability of the event B given that the event A has occurred is

𝑃(𝐵|𝐴) = 𝑃(𝐴 𝐵)

𝑃(𝐴)

=𝑃(𝐵)

𝑃(𝐴) 𝐴 𝐵 = 𝑇𝐻𝐻,𝐻𝐻𝐻,𝐻𝐻𝑇,𝐻𝑇𝐻

=4

8∙

8

7

=4

7

Ex.2. Out of 50 people surveyed in a study, 35 people smoke in which 20 are males. What is the

probability that if a person surveyed is smoke then he is male?

Ans: Here, 𝑛 𝑆 = 50

Let A be the event that person is a smoker.

𝑃(𝐴) = 𝑛(𝐴)

𝑛(𝑆) =

35

50

Let B be the event that person is a male smoker.

Page 76: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

68

𝑃(𝐵) = 𝑛(𝐵)

𝑛(𝑆) =

20

35

Then the probability that a person being male is smoker is

𝑃(𝐵|𝐴) = 𝑃(𝐴 𝐵)

𝑃(𝐴) 𝐴 𝐵 =

20

50 i.e. person being male and smoker

= 20

50∙

50

35

= 4

7

Multiplication Property:

If A and B are the independent events of a given random experiment then

𝑃 𝐴 ∩ 𝐵 = 𝑃(𝐴) ∙ 𝑃(𝐵|𝐴)

Ex.4. Find the probability that a single toss of a die will result in a number less than 3 if it is given

that the toss resulted in an odd number.

Ans: For tossing of a die, 𝑆 = 1, 2, 3, 4, 5, 6 𝑛(𝑆) = 6

Given that toss is already resulted in odd number.

Let event A: toss resulted in an odd number.

𝐴 = 1, 3, 5 𝑛(𝐴) = 3

𝑃(𝐴) = 3

6 =

1

2

Let event B: single toss will result in number less than 4

𝐵 = 1, 2, 3 𝑛(𝐵) = 3

𝑃(𝐵) = 3

6 =

1

2

𝐴 𝐵 = 1, 3 𝑛(𝐴 𝐵) = 2

𝑃(𝐴 𝐵) = 2

6 =

1

3

Hence, the required probability is,

= 𝑃(𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑠 𝑙𝑒𝑠𝑠 𝑡𝑕𝑎𝑛 4 𝑔𝑖𝑣𝑒𝑛 𝑡𝑕𝑎𝑡 𝑖𝑡 𝑖𝑠 𝑜𝑑𝑑)

= 𝑃(𝐵|𝐴) = 𝑃( 𝐴 𝐵)

𝑃(𝐴)=

13

12

= 2

3

Ex.5. A bag contains 3 pink candies and 7 green candies. Two candies are taken out from the bag

with replacement. Find the probability that both candies are pink.

Ans: here, 𝑛(𝑆) = 3 + 7 = 10

Let A be the event that first candy is pink and

B be the event that second candy is pink

Page 77: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

69

𝑃 𝐴 = 𝑃(𝐵) =3

10

Since candies are taken out with replacement, both events A and B are independent.

𝑃 𝐵 𝐴 = 𝑃 𝐵 =3

10

Hence, probability that both candies are pink is,

𝑃 𝐴 ∩ 𝐵 = 𝑃(𝐴) ∙ 𝑃(𝐵|𝐴)

= 3

10 ∙

3

10

=9

100 = 0.09

Random Variables:

Specifying a model for a random experiment via a complete description of sample space S

and probability P may not always be convenient or necessary. In practice we are only interested in

various observations (i.e., numerical measurements) of the experiment. We include these into our

modelling process via the introduction of random variables.

A random variable is a function that associates a real number with each element in the

sample space. A random variable is neither random nor a variable. A random variable is a function

defined on a sample space. The values of the function can be anything at all, but for us they will

always be numbers.

E.g. consider the sample space for tossing a fair coin twice:

𝑆 = 𝐻𝐻,𝐻𝑇,𝑇𝐻,𝑇𝑇

These outcomes are equally likely. There are several random quantities we could associate

with this experiment. For example, we could count the number of heads, or the number of tails.

Formally, a random variable is a real valued function which acts on elements of the sample

space (outcomes) i.e. to each outcome. The random variable assigns a real number. Random

variables are always denoted by upper case letters.

In our example, if we let X be the number of heads, we have

𝑋 (𝐻𝐻) = 2;

𝑋 (𝐻𝑇) = 1;

𝑋 (𝑇𝐻) = 1;

𝑋 (𝑇𝑇) = 0:

Hence, 2, 1, 1, 0 are the random variables for the outcomes in sample space S.

In short,

Outcomes HH HT TH TT

Random variable (X) 2 1 1 0

There are two types of random variable:

Page 78: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

70

a) Discrete random variable

b) Continuous random variable

a) Discrete random variable takes only isolated or integral values. It is a countable number of real

values.

E.g. marks obtained by the students, number of accidents caused in a year, etc.

b) Continuous random variable can take all possible values between certain limits or in an interval.

e.g. measurements of rainfall, lifetime of a component, height and weight of etc.

Probability Distribution:

In probability and statistics, a probability distribution assigns a probability to

each measurable subset of the possible outcomes of a random experiment. Probability distributions

are used on both theoretical as well as a practical level. A listing of all the values, the random

variable can assume with their corresponding probabilities make a probability distribution.

A discrete probability distribution is a table (or a formula) listing all possible values that

a discrete variable can take on, together with the associated probabilities. It is a function that

satisfies the following properties:

1. The probability of discrete variable x is given by

𝑃 𝑋 = 𝑥 = 𝑃 𝑥 = 𝑝𝑥

2. It is non-negative for all real x.

3. The sum of P(x) over all possible values of x is 1,

𝑝𝑖 = 1𝑛𝑖=1

4. Discrete probability functions are referred as probability mass functions.

A continuous probability distribution is a function that satisfies following properties:

1. The probability of continuous variable x between two points a and b is

𝑃 𝑎 ≤ 𝑥 ≤ 𝑏 = 𝑓 𝑥 𝑑𝑥𝑏

𝑎

2. It is non-negative for all real x.

3. The integral probability function is one,

𝑓 𝑥 𝑑𝑥−∞

∞= 1

4. Continuous probability functions are referred as probability density functions.

Some practical uses of probability distribution are:

(i) To calculate confidence interval for parameters and to calculate critical region for

hypothesis tests.

(ii) For univariate data, it is often useful to determine a reasonable distribution model for the

data.

(iii) Simulation studies with random numbers generated from using a specific probability

distribution are often needed.

Page 79: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

71

In general, if 𝑒1 , 𝑒2 ,… , 𝑒𝑛 are the n outcomes of a sample space and 𝑥1 ,𝑥2, … , 𝑥𝑛 are

corresponding random variables with the probabilities 𝑝1 ,𝑝2, … ,𝑝𝑛 then the probability

distribution table is given by,

Outcomes (𝑆) 𝑒1 𝑒2 ........ 𝑒𝑛

Random variables (𝑋) 𝑥1 𝑥2 ........ 𝑥𝑛 Total

Probability 𝑃(𝑋 = 𝑥𝑖) 𝑝1 𝑝2 ........ 𝑝𝑛 1

Ex.1. Suppose a coin is tossed twice. Find the probability distribution for the head at the top.

Ans: A coin is tossed twice. Hence the distribution is as follows,

Outcomes (𝑆) 𝐻𝐻 𝐻𝑇 𝑇𝐻 𝑇𝑇 4

Random variables (𝑋) 2 1 1 0 Total

Probability 𝑃(𝑋 = 𝑥𝑖) 1

2

1

4

1

4

0 1

Ex.2. Find the probability function corresponding to the random variable X for Head up assuming

that the fair coin is tossed thrice.

Ans: Outcomes 𝑆 = 𝐻𝐻𝐻,𝐻𝐻𝑇,𝐻𝑇𝐻,𝐻𝑇𝑇,𝑇𝐻𝐻,𝑇𝐻𝑇,𝑇𝑇𝐻,𝑇𝑇𝑇

Random variables (𝑋) 3 2 2 1 2 1 1

Probability 𝑃(𝑋 = 𝑥𝑖) 3

8

1

4

1

4

1

8

1

4

1

8

1

8

Ex.3. Find the constant 𝑐 such that,

𝑓 𝑥 = 𝑐𝑥2 0 < 𝑥 < 3

= 0 𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒 is a density function.

Ans: since 𝑓 𝑥 satisfies the 2nd property,

𝑓 𝑥 = 1∞

−∞

Now,

𝑓 𝑥 = 𝑐𝑥2 =𝑐𝑥3

3

30

3

0= 9𝑐

−∞

Hence,

9𝑐 = 1 𝑐 =1

9

Expected value and Variance:

For the given random variables X with the probability distribution,

Random variables (X) 𝑥1 𝑥2 ........ 𝑥𝑛

Probability P(x) 𝑝1 𝑝2 ........ 𝑝𝑛

Page 80: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

72

The Expected value (or Mean) of random variables is a number E(X) given by,

𝐸 𝑋 = 𝑥𝑖𝑝𝑖𝑛𝑖=0 (For Discrete variables)

𝐸 𝑋 = 𝑥 𝑓(𝑥)−∞

∞ (For Continuous variables)

Ex.1. A die is thrown. The random variable X is “the number of dots that appear”. Find the expected

value of this random variable.

Ans: For throwing of a die, the outcomes of dots are 𝑆 = 1, 2, 3, 4, 5, 6

Hence, the probability distribution table is given by

No. of dots (xi) 1 2 3 4 5 6 Total

P(X=xi) = pi 1

6

1

6

1

6

1

6

1

6

1

6

1

𝒙𝒊𝒑𝒊 1

6

1

3

1

2

2

3

5

6

1 3.5

𝐸 𝑋 = 𝑥𝑖𝑝𝑖6𝑖=1 =

21

6 =

7

2 = 3.5

Ex.2. A lot containing 7 components is sampled by a quality inspector; the lot contains 4 good

components and 3 defective components. A sample of 3 is taken by the inspector. Find the expected

value of the number of good components in this sample.

Ans: This is a case of combination where 𝑛 = 7 and 𝑟 = 3

To find n(S),

7C3 = 7!

3! (7 − 3)! =

7×6×5×4!

3! × 4! =

7×6×5

3×2×1 = 35 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑠

So, 𝑛(𝑆) = 35

Using the formula, 4Cr 3C3-r find the number of samples containing 0, 1, 2 or 3 good

components replacing r = 0, 1, 2, 3 respectively.

The Probability distribution table for number of good components in a sample is,

No. of good comp.(xi) 0 1 2 3 Total

P(X=xi) = pi 1

35

12

35

18

35

4

35

1

𝒙𝒊𝒑𝒊 0 12

35

36

35

12

35

60

35

𝐸 𝑋 = 𝑥𝑖𝑝𝑖6𝑖=1 =

60

35 =

12

7 = 1.7

Page 81: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

73

Thus, if a sample of size 3 is selected at random again and again from a lot of 4 good

components and 3 defective components, it will contain, on average, 1.7 good components.

The Variance of a random variable X with the probability distribution P(X=xi) is a number

Var(X) or 𝜎2 given by,

𝜎2 = 𝑉𝑎𝑟 𝑋 = 𝐸[𝑋 − 𝐸 𝑋 ]2 = [𝑥𝑖 − 𝐸 𝑋 ]2𝑥 ∙ 𝑝𝑖

So, 𝜎2 = 𝐸 𝑋2 − [𝐸 𝑋 ]2 (For Discrete variables)

𝜎2 = 𝑉𝑎𝑟 𝑋 = 𝐸[𝑋 − 𝐸 𝑋 ]2 = [𝑥𝑖 − 𝐸 𝑋 ]2 ∙ 𝑝𝑖∞

−∞ (For Conti. variables)

Ex.3. Let the random variable X represents the number of automobiles that are used for official

business on any given workday. The probability distribution for company is

X = xi 0 1 2 3

P(X = xi) = pi 0.2 0.1 0.3 0.3

Calculate variance for random variable X.

Ans: To calculate expected value,

𝐸 𝑋 = 𝑥𝑖𝑝𝑖3𝑖=0 = 1.8

𝐸 𝑋2 = 𝑥𝑖2𝑝𝑖

3𝑖=0 = 4.4

Now, 𝑉𝑎𝑟(𝑋) = 𝜎2 = 𝐸 𝑋2 − [𝐸 𝑋 ]2

= 4.4 – (1.8)2

= 4.4 – 3.24

= 1.16

Ex.4. Let the random variable X represents the number of defective parts for a machine when 3

parts are sampled from a production line and tested. The following is the probability distribution of

X. Calculate 𝜎2 .

xi 0 1 2 3

pi 0.51 0.38 0.10 0.01

X = xi 0 1 2 3 Total

P(X = xi) = pi 0.2 0.1 0.4 0.3 1

𝒙𝒊𝒑𝒊 0 0.1 0.8 0.9 1.8

𝑿𝟐 0 1 4 9

𝒙𝒊𝟐 ∙ 𝒑𝒊 0 0.1 1.6 2.7 4.4

Page 82: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

74

Ans: Prepare the following probability distribution table.

𝐸 𝑋 = 𝑥𝑖𝑝𝑖3𝑖=0 = 0.61

𝐸 𝑋2 = 𝑥𝑖2𝑝𝑖

3𝑖=0 = 0.87

Now, 𝑉𝑎𝑟(𝑋) = 𝜎2 = 𝐸 𝑋2 − [𝐸 𝑋 ]2

= 0.87 – (0.61)2

= 0.87 – 0.3721

= 0.4979

Ex.5. Find the expected value for the density function of a random variable X given by

𝑓 𝑥 =1

2𝑥 0 < 𝑥 < 2

= 0 𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒

Ans: 𝐸 𝑋 = 𝑥 𝑓(𝑥)−∞

= 𝑥 (1

2𝑥)

2

0

= 1

2𝑥22

0=

𝑥3

6

20

=4

3

Following are some special Probability Distributions:

1. Binomial Distribution:

One of the most commonly encountered discrete distributions is the binomial distribution.

It is also known as the Bernoulli distribution as it is based on the Bernoulli trial or Bernoulli

process. An experiment, involving repeated trials where only two complementary outcomes are

possible which can be labelled either as a “success” or “failure”, is called a Bernoulli process. The

most obvious application deals with the testing of items as they come off an assembly line, where

each trial may indicate a defective or a non-defective item. We may choose to define either outcome

as a success.

If p is the probability of success then 𝑞 = (1 – 𝑝) is the probability of failure

The Bernoulli process must possess the following properties:

1. The experiment consists of repeated trials.

2. Each trial results in an outcome that may be classified as a success or a failure.

xi 0 1 2 3 Total

pi 0.51 0.38 0.10 0.01 1

𝒙𝒊𝒑𝒊 0 0.38 0.20 0.03 0.61

𝑿𝟐 0 1 4 9

𝒙𝒊𝟐 ∙ 𝒑𝒊 0 0.38 0.40 0.09 0.87

Page 83: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

75

3. The probability of success, denoted by p, remains constant from trial to trial.

4. The repeated trials are independent.

If 𝑛 = number of independent Bernoulli trials of an experiment and

𝑟 = number of success observed in Bernoulli experiment

then the number of combinations of n trials with r successes is given by:

nCr = 𝐶 𝑛, 𝑟 = 𝑛𝑟 =

𝑛 !

𝑟 ! (𝑛 − 𝑟)!

The number X of successes in n Bernoulli trials is called a binomial random variable. The

probability distribution of this discrete random variable is called the binomial distribution, and

its values will be denoted by B(x; n, p) or X ~ B(n, p) since they depend on the number of trials and

the probability of a success on a given trial.

Thus, in n trials, the total number of possible ways of obtaining r successes and (n–r)

failures is:

Probability(r successes out of n trials) = 𝑃(𝑋 = 𝑟)

𝑃 𝑟 = 𝑛𝑟 𝑝𝑟𝑞𝑛−𝑟

i.e. 𝑃 𝑟 = 𝑛 !

𝑟 ! (𝑛 − 𝑟)! 𝑝𝑟𝑞𝑛−𝑟

Where, 𝑛 = no. of independent trials

𝑟 = no. of success in n trials

𝑝 = Probability of success in one trial

𝑞 = 1 – 𝑝 = probability of failure

Note:

1. 𝑃 𝑟 ≤ 𝑛 = 𝑛𝑟 𝑝𝑟𝑞𝑛−𝑟𝑛

𝑟=0 for r = 0, 1, 2, ...

Hence, 𝑃 𝑟 ≥ 1 = 1− 𝑞𝑛

2. If n independent trials constitute an experiment and if this experiment is repeated N times, the

probability distribution or the expected frequencies are given by:

𝑓 𝑟 = 𝑁 𝑛𝑟 𝑝𝑟𝑞𝑛−𝑟 For 𝑟 = 0, 1, 2, . . .

3. The mean of binomial distribution is 𝑥 = 𝑛 𝑝

And the variance is 𝜎2 = 𝑛 𝑝 𝑞 i.e 𝑆𝐷 = 𝜎 = 𝑛 𝑝 𝑞

4. Binomial distribution: expresses the probability for r successes in an experiment with n trials

(0 ≤ 𝑟 ≤ 𝑛).

5. Geometric distribution: expresses the probability of having to wait exactly r trials before the

first successful event (𝑟 ≥ 1).

6. Negative Binomial distribution: expresses the probability of having to wait exactly r trials

until k successes have occurred (r ≥ k). This form is sometimes referred to as the Pascal

Page 84: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

76

distribution. Sometimes this distribution is expressed as the number of failures n occurring

while waiting for k successes (𝑛 ≥ 0).

Ex.1. If X is binomially distributed with 6 trials and a probability of success equals to 1

4 at each

attempt, what is the probability of:

a) Exactly 4 successes, b) At least one success?

Ans: Here, 𝑛 = 6 𝑝 = 1

4 𝑞 = 1 −

1

4 =

3

4

a) For exactly 4 successes, 𝑟 = 4

𝑃 𝑟 = 𝑛 !

𝑟 ! (𝑛 − 𝑟)! 𝑝𝑟𝑞𝑛−𝑟

𝑃 𝑋 = 4 = 6!

4! (6 − 4)!

1

4

4

3

4

6−4

= 15 × 1

256 ×

9

16

= 135

4096 = 0.033

b) For at least one success, 𝑋 = 𝑟 =≥ 1 and not Zero

𝑃 𝑟 ≥ 1 = 1− 𝑃 𝑟 = 0 ...𝑃(𝑟 = 0) is the failure.

= 1 − 3

4

6

= 1 − 729

4096

= 3367

4096 = 0.822

Ex.2. When an unbiased coin is tossed 8 times what is the probability of getting:

a) less than 4 heads b) more than 5 heads?

Ans: Here, 𝑛 = 8

Let p be the probability of getting head

𝑝 = 1

2 𝑞 = 1−

1

2 =

1

2

a) For less than 4 heads, 𝑋 = 𝑟 < 4 i.e. r ≤ 3

𝑃 𝑟 ≤ 3 = 𝑛𝑟 𝑝𝑟𝑞𝑛−𝑟3

𝑟=0

𝑃 𝑟 ≤ 3 = 𝑃 𝑟 = 0 + 𝑃 𝑟 = 1 + 𝑃 𝑟 = 2 + 𝑃(𝑟 = 3)

= 1

2

8+ 8𝐶1

1

2

1

1

2

7+ 8𝐶2

1

2

2

1

2

6+ 8𝐶3

1

2

3

1

2

5

= 1

2

8+ 8

1

2

8+ 28

1

2

8+ 56

1

2

8

= 93 1

2

8

= 93

256 = 0.3633

Page 85: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

77

b) For more than 5 heads, 𝑋 = 𝑟 > 5

𝑃 𝑟 > 5 = 𝑛𝑟 𝑝𝑟𝑞𝑛−𝑟8

𝑟=6

𝑃 𝑟 > 5 = 𝑃 𝑟 = 6 + 𝑃 𝑟 = 7 + 𝑃 𝑟 = 8

= 8C6 1

2

6

1

2

2+ 8C7

1

2

7

1

2

1+ 8C8

1

2

8

1

2

0

= 28 1

2

8+ 8

1

2

8+

1

2

8

= 37 1

2

8

= 37

256 = 0.1445

Ex.3. A biased die is thrown thirty times and the number of sixes seen is eight. If the die is thrown a

further twelve times, find:

a) the probability that a six will occur exactly twice;

b) the expected number of sixes;

c) the variance of number of sixes.

Ans: A biased die is thrown thirty times and the number of sixes seen is eight

So, 𝑝 = 8

30 =

4

15 𝑞 =

11

15

Now, let X is defined as “the number of sixes seen in 12 throws”

Here, n = 12

a) For the probability that a six will occur exactly once, 𝑋 = 𝑟 = 2

𝑃 𝑟 = 𝑛 !

𝑟 ! (𝑛 − 𝑟)! 𝑝𝑟𝑞𝑛−𝑟

𝑃 2 = 12!

2! (12 − 2)!

4

15

2

11

15

12−2

= 66 ×42 × 1110

1512

= 0.211

b) Expected number of sixes = mean

𝐸 𝑋 = 𝑟 = 𝑥 = 𝑛 𝑝 = 12 ×4

15= 3.2

c) Variance of sixes,

𝑉 𝑋 = 𝑟 = 𝜎2 = 𝑛 𝑝 𝑞 = 12 ×4

15×

11

15= 2.347

Ex.4. A random variable is binomially distributed with mean 6 and variance 4.2. Find 𝑃(𝑋 ≤ 6)

Ans: Since X is a binomial distribution,

Mean = 𝑛 𝑝 = 6

Variance = 𝑛 𝑝 𝑞 = 4.2

6 × 𝑞 = 4.2

Page 86: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

78

𝑞 = 4.2

6 = 0.7

Also,

𝑝 = 1 – 𝑞 = 1 – 0.7 = 0.3

This gives,

𝑛 × 0.3 = 6

𝑛 = 20

Now,

𝑃 𝑟 ≤ 6 = 20𝑟 0.3𝑟0.7𝑛−𝑟6

𝑟=0

𝑃 𝑋 = 𝑟 ≤ 6 = 𝑃 𝑟 = 0 + 𝑃 𝑟 = 1 + 𝑃 𝑟 = 2 + 𝑟 = 3 + 𝑃 𝑟 = 4 + 𝑃 𝑟 = 5 + 𝑃(𝑟 = 6)

=20C0 0.3 20 +20C1 0.3 1 0.7 19 +20C2 0.3 2 0.7 18 + 0C3 0.3 3 0.7 17 + 20C4

0.3 4(0.7)16 +20C5 0.3 5 0.7 15 + 20C6 0.3 6 0.7 14

= 0.6080

Ex.5. Inland Revenue audits 5% of all companies every year. The companies selected for auditing in

any one year are independent of the previous year’s selection.

a) What is the probability that the company ‘Ross Waste Disposal’ will be selected for auditing

exactly twice in the next 5 years?

b) What is the probability that the company will be audited exactly twice in the next 2 years?

c) What is the exact probability that this company will be audited at least once in the next 4

years?

Ans: Here, 𝑝 = 0.05 𝑞 = 1− 𝑝 = 0.95

a) For 𝑛 = 5 𝑟 = 2

𝑃 𝑋 = 𝑟 = 𝑛!

𝑟! (𝑛 − 𝑟)! 𝑝𝑟𝑞𝑛−𝑟 =

5!

2! 3! (0.05)2(0.95)3 = 0.0214

b) For 𝑛 = 2 𝑟 = 2

𝑃 𝑋 = 𝑟 = 𝑛!

𝑟! (𝑛 − 𝑟)! 𝑝𝑟𝑞𝑛−𝑟 =

2!

2! 1 (0.05)2(0.95)2 = 0.0025

c) For 𝑛 = 4 𝑟 ≥ 1

𝑃 𝑋 = 𝑟 ≥ 1 = 1− 𝑃(𝑋 = 0)

= 1−4!

1 4! (0.05)0(0.95)4 = 0.1854

Ex.6. For a binomial distribution, mean is 6 and S.D. is 2. Find n, p, q.

Ans: Given 𝑥 = 6 𝜎 = 2

We know,

𝑥 = 𝑛𝑝 ⇒ 6 = 𝑛𝑝

Now,

Page 87: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

79

𝜎2 = 𝑛 𝑝 𝑞 ⇒ 2 = 6 × 𝑞

𝑞 =1

3

Again,

𝑝 = 1− 𝑞 ⇒ 𝑝 = 1−1

3

𝑝 =2

3

Hence, 𝑥 = 𝑛𝑝 ⇒ 6 = 𝑛2

3

𝑛 = 9

Ex.7. Eight coins are tossed at a time 256 times. Number of heads at each throw is recorded and

results are given below. Find the expected frequencies and fit the Binomial distribution.

No. of Heads at a throw 0 1 2 3 4 5 6 7 8

Frequency 2 6 30 52 67 56 32 10 1

Ans: The probability of getting a head in a single throw is,

𝑃 𝐻 = 𝑝 =1

2

Hence, 𝑃 𝑇 = 𝑞 = 1− 𝑝 =1

2

Given that, 𝑛 = 8 𝑁 = 256

The expected frequencies are given by successive terms of B.D. as. 𝐵.𝐷. = 𝑁 𝑝 + 𝑞 𝑛

Hence B.D. table is,

No. of Heads(𝑋 = 𝑟) 𝐹𝑟𝑒𝑞 = 𝑁[ 𝑛𝑟 𝑝𝑛−𝑟𝑞𝑟]

𝑓𝑖

0 256 ×

8

0

1

2

8

1

2

0

1

1 256 ×

8

1

1

2

7

1

2

1

8

2 256 ×

8

2

1

2

6

1

2

2

28

3 256 ×

8

3

1

2

5

1

2

3

56

4 256 ×

8

4

1

2

4

1

2

4

70

5 256 ×

8

5

1

2

3

1

2

5

56

6 256 ×

8

6

1

2

2

1

2

6

28

7 256 ×

8

7

1

2

1

1

2

7

8

8 256 ×

8

8

1

2

0

1

2

8

1

Total 256

The Binomial distribution is properly fit.

Page 88: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

80

2. Poisson’s Probability Distribution:

Poisson’s experiments are those that involve the number of outcomes of a random variable

X which occur per unit time. The Poisson distribution is a very important discrete probability

distribution, which arises in many different contexts in probability and statistics. Poisson’s

distribution is used in place of binomial distribution in following situations:

i) The number of trials n is very large (i.e. n→ ∞)

ii) The probability of success p in one trial is indefinitely small (p→0)

iii) The expectation or mean np = λ is constant

A discrete random variable X is said to follow Poisson’s distribution for r outcomes or trials

if it assumes non-negative value w

ith the probability distribution is denoted by X ~ P(λ) and given by:

𝑃 𝑋 = 𝑟, 𝜆 = 𝜆𝑟

𝑟 ! 𝑒−𝜆 𝑟 = 0, 1, 2, . . .

= 0 Otherwise

where, λ is the parameter of the Poisson’s distribution.

The probability distribution of the Poisson random variable X, representing the number of

outcomes occurring in a given time interval or specified region denoted by t, is

𝑃 𝑋 = 𝑟, 𝜆𝑡 = (𝜆𝑡 )𝑟

𝑟 ! 𝑒−𝜆𝑡 𝑟 = 0, 1, 2, . . .

= 0 otherwise

where, λ is the average number of outcomes per unit time, distance, area or volume

The Poisson distribution occurs in different situations, for example:

1. It gives the probabilities of a given number of phone calls in a certain time interval;

2. It gives the probabilities of a given number of flaws on a length unit of a wire;

3. It gives the probabilities of a specific number of faults on an area unit of a fabric;

4. It gives the probabilities of a specific number of bacteria in a volume unit of a solution;

5. It gives the probabilities of a specific number of accidents on time unit.

Note:

1. For 𝑋 ∼ 𝑃 𝜆 ,

𝐸𝑥𝑝𝑒𝑐𝑡𝑎𝑡𝑖𝑜𝑛 = 𝑚𝑒𝑎𝑛 = 𝐸(𝑋) = 𝜆

𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑉(𝑋) = 𝜆

2. 𝑃 𝑋 = 𝑟; 𝜆 = 𝑃(𝑟; 𝜆)𝑛𝑟=0

Page 89: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

81

Ex.1. During a laboratory experiment, the average number of radioactive particles passing through

a counter in 1 millisecond is 4. What is the probability that 6 particles enter the counter in a given

millisecond?

Ans: Here the outcomes 𝑟 = 6 𝜆𝑡 = 4

Using Poisson’s distribution,

𝑃 𝑋 = 𝑟, 𝜆𝑡 = (𝜆𝑡 )𝑟

𝑟 ! 𝑒−𝜆𝑡

𝑃 6; 4 = (4)6

6! 𝑒−4 = 0.1042

Ex.2. The average number of planes landing at an airport each hour is 10 while the maximum

number it can handle is 15. What is the probability that on a given hour some planes will have to be

put on a holding pattern?

Ans: Here, the outcome 𝑋 = 𝑟 > 15 𝜆𝑡 = 10

Using Poisson’s distribution of sum,

𝑃 𝑋 = 𝑟; 𝜆 = 𝑃(𝑟; 𝜆)𝑛𝑟=0

𝑃 𝑋 => 15; 𝜆 = 1− 𝑃(𝑟 ≤; 𝜆)15𝑟=0

= 1 – [𝑃(𝑟 = 0) + 𝑃(𝑟 = 1) + 𝑃(𝑟 = 2) +. . . + 𝑃(𝑟 = 15)]

= 1 – 0.9513

= 0.0487

Ex.3. The average number of accidents at a level-crossing every year is 5. Calculate the probability

that there are exactly 3 accidents this year.

Ans: Here, 𝑟 = 3 𝜆𝑡 = 5

𝑃 𝑋 = 𝑟, 𝜆𝑡 = (𝜆𝑡 )𝑟

𝑟 ! 𝑒−𝜆𝑡

𝑃 𝑋 = 3,5 = (5)3

3! 𝑒−5 = 0.1404

i.e. there is 14% probability of exactly 3 accidents this year.

Ex.4. Fit a Poisson’s Distribution to the following data which gives the frequency of number of

death due to cancer to a person of 10 corps per army per annum over twenty years.

Death 0 1 2 3 4 Total

Frequency 109 65 22 3 1 200

Ans.: Here, 𝑁 = 200 𝑛 = 10

𝜆 = 𝑥 = 𝑓𝑖 𝑥𝑖 𝑓𝑖

= 0 × 109 + 1 × 65 + 2 × 22 + 3 × 3 + 4 × 1

200

𝜆 = 0.61

Page 90: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

82

Therefore; the Poissons distribution is 𝑃 𝑋 = 𝑟, 𝜆 = 𝑁 × 𝜆𝑟

𝑟! 𝑒−𝜆

X 0 1 2 3 4

f e -0.61 e -0.61 (0.61) e -0.61 (0.61)2

2 e -0.61

(0.61)3

2 e -0.61

(0.61)4

2

𝐹 = 200.𝑓 108.6 66.2 20.2 4.1 0.62

Frequency

(approx)

109 66 20 4 1

Total calculated frequency = 200

Comparing the observed and theoretical frequencies, conclusion is remarkably good which fits

Poisson’s distribution.

Ex.5. Fit a Poisson Distribution to following data which give the number of doddens in sample of

clover seeds.

No. of doddens (x) 0 1 2 3 4 5 6 7 8

Observed frequency (f) 56 156 132 92 37 22 4 0 1

Ans.: here, 𝑁 = 500

𝜆 = 𝑥 = 𝑓𝑖 𝑥𝑖 𝑓𝑖

= 986

500= 1.972

Therefore; the Poissons distribution is;

𝑃 𝑋 = 𝑟, 𝜆 = 𝑁 × 𝜆𝑟

𝑟! 𝑒−𝜆 = 500 ×

𝑒− 1.972 (1.972)𝑥

𝑥!

Calculation of theoretical frequencies is shown below;

X F

0 69.6 70

1 137.25 137

2 135.32 135

3 88.95 89

4 43.85 44

5 17.29 17

6 5.68 06

7 1.60 02

8 0.39 00

Total 500

Hence, observed value is same as the theoretical value. This shows that P.D. is fit properly.

Page 91: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

83

3. Normal Probability Distribution:

Normal distribution, also known as the Gaussian distribution, is the most important

continuous probability distribution is the study of statistics. Further, it is the parent distribution for

several important continuous distributions. It is used to model events which occur by chance such

as variation of dimensions of mass-produced items during manufacturing, experimental errors,

variability in measurable biological characteristics such as people’s height or weight,…

It is a special case of the Binomial distribution with the same values of mean and variance

but applicable when n is sufficiently large (𝑛 > 30). It is a two-parameter distribution denoted by

𝑵(𝒙 ,𝝈𝟐) and given by:

𝑃 𝑋 = 𝑟 =1

𝜎 2𝜋 ∙ 𝑒

−1

2 𝑟−𝑥

𝜎

2

−∞ < 𝑟 < ∞;

Where, 𝑥 and σ are the mean and standard deviations of the distribution respectively and

𝑧 = 𝑟−𝑥

𝜎 is called standard normal variate.

Ex.1. Suppose a particular population has 𝑥 = 4 and 𝜎 = 2. Find the probability of a randomly

selected value being greater than 6.

Ans: the 𝑍 value corresponding to 𝑃(𝑋 = 𝑟 = 6) is,

𝑧 = 𝑟−𝑥

𝜎=

6−4

2= 1

(𝑍 = 1 Means that the value 𝑟 = 6 is 1 standard deviation above the mean)

The normal distribution is of great importance for the following reasons:

(i) It is often suitable as a probability model for measurements of weight, length, strength, etc.

(ii) Non-normal data can often be transformed to normality.

(iii) The central limit theorem states that when we take a sample of size n from any distribution

with a mean 𝑥 and a variance σ2, the sample mean will have a distribution which gets closer

and closer to normality as n increases.

(iv) It can be used as an approximation to the binomial or the Poisson distributions when we

have large n or λ respectively (though this is less useful now that computers can be used to

evaluate binomial/Poisson probabilities).

(v) Many standard statistical techniques are based on the normal distribution.

(vi) We write 𝑋 ~ 𝑁(0, 1).

(vii) The standard normal distribution is symmetric about 0.

Page 92: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

84

Note:

1. The normal distribution curve (z curve) is “bell-shaped” having two tails at the end which never

meet X-axis theoretically and symmetric about 𝑋 = 𝑥

2. In a standard normal distribution, 𝑥 = 0 and σ2 = 1 denoted by N (0, 1)

Hence, 𝑚𝑒𝑎𝑛 = 𝑚𝑒𝑑𝑖𝑎𝑛 = 𝑚𝑜𝑑𝑒 = 𝑥

3. The area under the Z curve gives the probability.

i.e. 𝑃 (−∞ < 𝑟 < ∞) = 1

Since the curve is symmetric about 𝑥 , we have

𝑃 (𝑋 = 𝑟 < 𝑥 ) = 𝑃 ( 𝑋 = 𝑟 > 𝑥 ) = 0.5

4. Any normal distribution can be converted into standard normal variate (SVN) Z using the

formula

For 𝑋 ~ 𝑁 (𝑥 ,𝜎2), 𝑧 = 𝑥−𝑥

𝜎 [It uses Z table for finding probability]

5. The probability of the variate having a value within a certain interval [a, b] is calculated using

𝑃 𝑋 = 𝑟 = 1

𝜎 2𝜋 ∙ 𝑒

−1

2 𝑥−𝑥

𝜎

2

∙ 𝑑𝑥𝑏

𝑎 for 𝑎 < 𝑥 < 𝑏

Process of finding z value:

Draw a diagram and label with given values i.e. 𝑥 populationmean, pop S.D. and 𝑟

(rawscore).

Shade area required as per question.

Convert raw score 𝑟 to standard score Z using formula.

Use tables to find probability: eg p0 Z z.

Adjust this result to required probability.

Ex.2. Wool fibre breaking strengths are normally distributed with mean 𝑥 = 23.56 Newton and

standard deviation 𝜎 = 4.55. What proportion of fibres would have a breaking strength of 14.45

or less?

Ans: Here, 𝑥 = 23.56 𝜎 = 4.55 𝑟 = 14.45

Draw a diagram and label with given values

Page 93: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

85

Convert 𝑟 = 14.45 to Z value

𝑧 = 14.45−23.56

4.55= −2.0

That is, the raw score of 14.45 is equivalent to a standard score of -2.0. It is negative

because it is on the left hand side of the curve.

Use tables to find probability and adjust this result to required probability:

𝑃 𝑟 < 14.45 = 𝑃 𝑧 < −2.0

= 0.5− 𝑃(0 < 𝑧 < 2)

= 0.5− 0.4772

= 0.0228

Inverse process: (to find a value for 𝑟, corresponding to a given probability)

Draw a diagram and label.

Shade area given as per question.

Use probability tables to find Z –score.

Convert standard score Z to raw score 𝑟 using inverse formula.

𝑟 = 𝑧 × 𝜎 + 𝑥

Ex.3. Carrots entering a processing factory have an average length of 15.3 cm and standard

deviation of 5.4cm. If the lengths are approximately normally distributed, what is the maximum

length of the lowest 5% of the load (Given 𝑇𝑎𝑏 𝑧 = 1.645 at 5 %)?

Ans: Here, 𝑥 = 15.3 𝜎 = 5.4 𝑟 =?

Draw a diagram and label it.

Use standard Normal tables to find the Z -score corresponding to this area of probability.

Convert the standard score Z to a raw score 𝑟 usi ng the inverse formula

𝑟 = 𝑧 × 𝜎 + 𝑥

Here, 𝑃(𝑍) for 5% is -1.645 from normal Z table (negative because it is below mean)

Hence, 𝑟 = 𝑧 × 𝜎 + 𝑥

= −1.645 × 5.4 + 15.3

= 6.4

Lowest maximum length is 6.4cm.

Page 94: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

86

X=195 σ=25

Ex.4. The finish times for marathon runners during a race are normally distributed with a mean of

195 minutes and a standard deviation of 25 minutes.

a) What is the probability that a runner will complete the marathon within 3 hours?

b) Calculate to the nearest minute, the time by which the first 8% runners have completed the

marathon.

c) What proportion of the runners will complete the marathon between 3 hours and 4 hours?

Ans: Here, 𝑥 = 195 𝜎 = 25

a)

𝑟 = 180 ⇒ 𝑧 = 180−195

25= −0.6

𝑃 𝑍 < −0.6 = 0.5− 𝑃(0 < 𝑧 < 0.6)

= 0.5− 0.2257

= 0.2743

b) For 𝑝 = 0.08, 𝑍 = −1.41

−1.41 = 𝑟−195

25 ⇒ 𝑟 = −1.41 × 25 + 195 = 159.75 ≅ 160 𝑚𝑖𝑛

Hence, the first 8% runners have completed marathon in 160min.

a) 𝑟 = 180 ⇒ 𝑧 = 180−195

25= −0.6

𝑃 𝑍 < −0.6 = 0.2743

𝑟 = 240 ⇒ 𝑧 = 240−195

25= 1.8

𝑃 𝑍 < 1.8 = 0.9641

Hence,

𝑃 −0.6 < 𝑧 < 1.8 = 0.9641− 0.2743

= 0.6898

Hence, proportion of runners taking between 3hrs and 4hrs is approx 70%

Ex.5. For the following standard normal variates z find the proportion (area) occupied by them as

measured from zero.

i) z = 1.98

ii) z = -0.5

iii) z = 1.35 to 2.18

iv) z = 1.98 to 0.5

r=180

x=195

σ=25

r=180 r=240

Page 95: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

87

Ans.: From the z table it is seen that (See table in appendix-….)

i) 𝐴 = 0.4762 𝑓𝑜𝑟 𝑧 = 1.98

ii) A = 0.1915 for z = -0.5

iii) 𝐴 = 0.4854 𝑓𝑜𝑟 𝑧 = 1.35

𝐴 = 0.4115 𝑓𝑜𝑟 𝑧 = 2.18

𝑃 (1.35 𝑧 2.18) = 0.4854 – 0.4115

𝑧 = 2.18 = 0.0739

iv) 𝐴 = 0.4762 𝑓𝑜𝑟 𝑧 = −1.98

𝐴 = 0.1915 𝑓𝑜𝑟 𝑧 = 0.5

𝑃 (−1.98 𝑧 0.5) = 0.4762 – 0.1915 = 0.6677

Note: Shaded portion indicates required area.

Page 96: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

88

Exercise

Q.1. Suppose that a pair of fair dice are to be tossed, and let the random variable X denote the sum

of the points. Obtain the probability distribution for X

Q.2. Find the expected value for the density function of a random variable X given by

𝑓 𝑥 =1

2𝑥 0 < 𝑥 < 2

= 0 𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒

Q.3. Find the variance and standard deviation of the random variable of above question 2.

Q.4. The probability that a driver must stop at any one traffic light coming to Lincoln University is

0.2. There are 15 sets of traffic lights on the journey.

a) What is the probability that a student must stop at exactly 2 of the 15 sets of traffic lights?

b) What is the probability that a student will be stopped at 1 or more of the 15 sets of traffic

lights?

Q.5. The number of typing mistakes made by a secretary has a Poisson distribution. The mistakes

are made independently at an average rate of 1.65 per page. Find the probability that a three-page

letter contains no mistakes.

Q.6. The download time of a resource web page is normally distributed with a mean of 6.5 seconds

and a standard deviation of 2.3 seconds.

a) What proportion of page downloads take less than 5 seconds?

b) What is the probability that the download time will be between 4 and 10 seconds?

c) How many seconds will it take to complete 35% of the download?

Q.7. For a binomial distribution, mean is 5 and S.D. is16. Find n, p, q.

Q.8. Mean and S.D. of a binomial distribution are 3 and 2. Find n, p, q.

Q.9. For a binomial distribution, mean is 206 and S.D. is 4. Find n, p, q.

Q.10. Fit Poisson’s distribution to following.

Death 0 1 2 3 4

Freq 122 60 15 2 1

Page 97: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

89

3. Sample and sampling techniques

Introduction:

In the first chapter, we have discussed the collection, distribution and analysis of collected

scientific data. The data in biostatistics are generally based on individual observations. Sampling is

very often used in our daily life. For example, while purchasing food grains from a shop we usually

examine a handful from the bag to assess the quality of the commodity. A doctor examines a few

drops of blood as sample and draws conclusion about the blood constitution of the whole body.

Thus, most of our investigations are based on samples. In this chapter, let us see the importance of

sampling and the various methods of sample selections from the population.

Population:

In a statistical enquiry, all the items, which fall within the range of enquiry, are known as

Population or Universe. In other words, the population is a set of all possible observations, which

are to be investigated, having at least one property in common. For example, the population of any

country has a common language, literature, geographic origin and genetic heritage which

distinguish them from people of different nationalities. Total number of students studying in a

school or college, total number of books in a library, total number of houses in a village or town is

some examples of population.

The objects or individuals in the population are called members or elements and the

number of members in the population constitutes population size. Depending on population size,

population can be finite or infinite. If the number of members in the population is finite/ countable

then it is finite population. E.g. number of students in a college, number of workers in a factory,

production of articles in a particular day for a company. If number of members in the population is

infinite then it is infinite population. E.g. number of stars in a galaxy, number of people seeing the

Television programmes etc. Statisticians use the word population to refer not only to people but to

all items that have been chosen for study.

Census:

Sometimes it is possible and practical to examine and study every person or item in the

population which is a complete enumeration called census. A census is the procedure of systematic

collection and recording of information about the every member of a given population. It provides

true measures of population. For example, if we study the average annual income of the families of

a particular area having 1000 families then we must have to study income of all the 1000 families

and in such a case, no family should left out.

The population census of India is taken at every 10 years interval. The first census was

taken in 1871 – 72. The latest census was taken in 2011.

Page 98: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

90

Merits of census:

1. The data is collected from each and every item of the population.

2. The results are more accurate and reliable.

3. Intensive study is possible.

4. The data collected may be used for various surveys, analyses etc.

Demerits of census:

1. It requires a large number of enumerators and it is a costly method.

2. It requires more money, labour, time energy etc.

3. It is not possible in some circumstances where the universe is infinite.

Sample:

If population is infinite or very large then it is impossible to study each and every member.

In this case, population is divided into small groups of members so all the important properties or

characteristics are covered in members of those groups. Such groups are known as samples. Thus,

Sample is a small group of finite members selected from statistical population so that all the

important characteristics of entire population are covered in members of the group. It is a subset

and representative of people, events or items from a larger population. To represent a population

well, a sample should be randomly collected and adequately large. The members of sample selected

from population which cannot be further subdivided for sampling are known as sample points and

the number of members in a sample is called the sample size. Often, it is necessary to use samples

for research, because it is impractical to study the whole population. For example, to study the

average height of 12-year-old boys in a country, we could not measure all of the 12-year-old boys in

that country, but we could measure a sample of 12-year-old boys.

Reasons for selecting a sample:

Sampling is inevitable in the following situations:

1. Complete enumerations are practically impossible when the population is infinite.

2. When the results are required in a short time.

3. When the area of survey is wide.

4. When resources for survey are limited particularly in respect of money and trained persons.

5. When the item or unit is destroyed under investigation.

Sampling frame:

For adopting any sampling procedure it is essential to have a list identifying each sampling

unit by a number. Such a list or map is called sampling frame. A list of voters, a list of house holders,

a list of villages in a district, a list of farmers etc. are a few examples of sampling frame.

Sampling Methods:

The method of selecting small groups i.e. samples from the population which represent the

characteristics (like height, weight, colour) of the population is called sampling method. If we want

Page 99: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

91

to get really good conclusions from our samples, we need to assure that we make a right choice of

our samples.

The sampling process involves following stages:

(i) Define the population of statistical analysis.

(ii) Specify a sampling frame, a set of items or events possible to measure.

(iii) Specifying a sampling method for selecting items or events from the frame.

(iv) Determining the sample size.

(v) Implementing the sampling plan.

(vi) Sampling and data collecting.

Merits of Sampling:

There are many advantages of sampling methods over census method. They are as follows:

1. Under sampling a statistical investigation is carried out speedily.

2. It results in reduction of cost, time, energy and labour.

3. Sampling ends up with greater accuracy of results.

4. The size of the sample can be increased or decreased according to the size of the universe,

availability of resources and degree of accuracy desired.

5. It has greater scope.

Types of Sampling Techniques (Methods):

Following are the different types of sampling which are commonly used:

1. Simple Random Sampling

2. Systematic Sampling

3. Stratified Random Sampling

4. Cluster Sampling

5. Quota Sampling

1. Simple Random Sampling:

It is the most popular method for choosing a sample among population for a wide range of

purposes. Simple random sample is a group of individuals selected from a larger population, using

either a random number table or random number generator. Every individual of this sample is

selected randomly and has equal chance (probability) of being selected. The process or technique of

selection of individuals with same probability of being selected is known as simple random

sampling.

Suppose we have population size (N) of 10,000 students in a university. Each of them is

known as unit or member. To select a sample of required size (n), let it be 200, we could use simple

random sampling. Students would be selected at random and sent to questionnaire for analysis.

Steps to create Simple Random Sample:

a) Define the population

Page 100: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

92

b) Select the sample size (n)

c) List the population

d) Assign numbers to the units

e) Choose random numbers

f) Select sample of required size

a) Define the population

In above example, the population size (N) is of 10,000 students of a university. As we are

interested in all these university students, our sample is all about those 10,000 students. If we were

interested in male students, then females would be rejected and population would be defined for

males and N would be less than 10,000.

b) Select the sample size

Suppose we want to choose the sample size (n) of 200 students. Sample size shows the limit

of a quantity and time require to distribute questionnaire to students.

c) List the population

For the sample of 200 students we have to identify all 10,000 students of the university. To

carry out research we have to take permission from Students record to view a list of all students

studying at university.

d) Assign numbers to the units

Now assign a consecutive numbers from 1 to N (population size) to each unit of the

population. In our case, we have to assign number from 1 to 10,000.

e) Find random numbers

Next make a list of 200 random numbers to select members of sample from the total list of

10,000 students. These random numbers can either be found using random number tables or

computer program that generates these numbers for you.

f) Select your sample

Finally, we select the 200 students corresponding to the selected 200 random numbers.

Suppose the first three random numbers are 0007, 8182, 0576. It means we have selected 11th,

8182nd and 576th students from the list of 10,000 students. Continue the process till we have sample

of all 200 students.

a) Simple Random Sampling with Replacement:

In this method, the first element is selected at random from population. Its characteristics

are studied and recorded then it is again replaced back into the population and second element is

selected at random. This process is continued till the sample of required size is selected. A unit may

be selected more than once. If any unit is repeating then reject it and select the other one. The

population size in this case remains same in every selection.

Page 101: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

93

b) Simple Random Sampling without Replacement:

In simple random sampling without replacement, first element is selected at random from

population of size N. After studying its characteristics it is not replaced back into the population.

Then second element is selected at random from the remaining population of size N-1 and so on.

Thus the size of population goes on decreasing in each selection. In this method, element once

selected for sample cannot be repeated.

Merits of simple random sampling:

1) It provides a sample which is highly representative of the population being studied,

assuming that there is limited missing data.

2) It is a fair way of selecting sample from a population as every member is given equal

chance of being selected.

3) It is easy to make inferences about whole population from the results of the sample,

because of representativeness of a sample obtained from population.

4) It shows more accuracy when the size of both population and sample is very large.

Demerits of simple random sampling:

1) It is highly expensive and time taking.

2) One of the most obvious limitations of simple random sampling is the need of a complete

list of all members of the population.

3) It is not suitable for the sample of very small size. In this case sample is not a true

representative of the population.

4) When there is large difference between the units of population, the simple random

sampling may not be a representative sample.

2. Systematic Random Sampling:

It is also known as Quasi-random sampling. Systematic random sampling is a little bit

different from simple random sampling. This method is frequently used when the population is

homogenous or of the same subgroup and a complete list of the population is available. In this

method all the members of the population are arranged in systematic and definite order. The

complete list of population may be arranged in alphabetical, geographical, or numerical order. The

first unit of the sample is selected at random and then the remaining units are selected in specific

manner. After selecting the first unit, the next subsequent elements are selected by taking every kth

member from the list of population till the sample of required size is completed; where k is the ratio

of population size (N) and required sample size (n) i.e.

k = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑖𝑧𝑒

𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒=

𝑁

𝑛

Suppose a researcher wants to study about the career goals of students in the Institute

which has near about 8,000 students. Thus, the population size (N) is 8,000. He wants to select a

Page 102: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

94

sample of size (n) 100 students using systematic sampling. With systematic random sampling, there

would be an equal chance of being selected for the required sample.

Steps to create Systematic Random Sample:

a) Define the population

b) Select the sample size (n)

c) List the population and arrange in specific or definite order

d) Calculate value of k

e) Select the first unit

f) Select sample of required size.

a) Define the population:

In the condition mentioned above, the population size (N) is 8,000 students in the Institute

and we are interested in all the students of the Institute. Institute may consist of males and females.

If we select females then the male students from the institute would be rejected.

b) Select the Sample Size (n):

Decide the number of members for the sample for the further study. Suppose we want to

choose the sample size (n) of 100 female students. Sample size shows the limit of a quantity and

time require to distribute questionnaire to students.

c) List the population and arrange in specific or definite order:

For the sample of 100 female students we have to identify all 8,000 students of the institute.

Collect the entire information about all the females studying in the institute. Then arrange all the

females in specific order i.e. either assign numbers from 1 to N or arrange in alphabetical manner.

d) Calculate value of ‘k’:

Assuming that we have chosen a sample of size 100 students, we need to find the value of k

which is the ratio of population size and sample size.

k = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑖𝑧𝑒

𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒=

𝑁

𝑛=

8,000

100 = 80

It tells us that we have to choose 1 student in every 80 students from the population of

8,000 students of the Institute.

e) Select the first unit:

After finding k, we need to select the first student at random. As we have assigned numbers

to the members of population, choose any student at random from 1 to 80 (k) and suppose it is 25th

student.

f) Select sample of required size:

We have the first member i.e. 25th student of our sample. So we can select the remaining 99

members easily using value k.

Now add 𝑘 = 80 to first member 25 which will give next member.

Page 103: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

95

25 + 80 = 105𝑡𝑕 is the second member.

Then 105 + 80 = 185𝑡𝑕 member is the third member.

Continue the process till the sample of required size is completed.

Merits of Systematic Random Sampling:

1) It is easy to construct, execute, compare and understand.

2) It reduces time and work.

3) It gives accurate results if properly performed.

4) It distributes the sample evenly over the population.

Demerits of Systematic Random Sampling:

1) It may not be possible to select the required sample size if the population is too small or

infinite.

2) Bad arrangement of the units may produce inefficient sample.

3) It may not be the representative of the whole population.

3. Stratified Random Sampling:

This technique is widely used and very useful when the population is heterogeneous with

respect to variables or characteristics. This heterogeneous population is then divided into several

smaller homogeneous groups. These groups are known as strata (singular Stratum). A simple

random sample of suitable size from each stratum is selected to constitute a required sample

known as stratified random sample. Since each stratum is more homogeneous than the original

population, we are able to get more precise estimates of the whole. Stratified random sampling is

also called proportional random sampling or quota random sampling. Generally, it is used in cases

like males vs. females; houses vs. apartments, etc where we are interested in particular strata

(groups) in a population.

For example, geographical regions can be stratified into similar regions by means of some

known variable such as habitat type, elevation or soil type. Another example might be to determine

the proportions of defective products being assembled in a factory. In this case sampling may be

stratified by production lines, factory, etc.

Suppose a researcher wants to study more about the career goals of students at University having

roughly 10,000 students (N) and he is interested in comparing the differences in career goals

between male and female students. Using following steps we will create stratified random sample.

Steps to create Stratified Random Sample:

a) Define the population

b) Select relevant stratification

c) List the population according to selected stratification

d) Select sample size (n)

e) Calculate proportionate stratification

f) Select sample of required size using simple random sampling or systematic random

sampling.

Page 104: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

96

a) Define the population:

Here the population is of 10,000 students at the University which is population size (N).

Our sampling frame is all 10,000 students as we are interested in all these students.

b) Select relevant stratification:

We want to study the differences in male and female students, so gender is the

required stratification. And hence we will use gender male and female as our strata.

c) List the population according to selected stratification:

Using either simple random sampling or systematic random sampling, assign a consecutive

numbers from 1 to nk to each of the students in each stratum. This will result in two lists, one

detailing all male students and one detailing all female students.

d) Select sample size (n):

Decide the number of members for the sample for the further study. Suppose we want to

choose the sample size (n) of 200 students.

e) Calculate proportionate stratification:

Consider out of 10,000 students, 60 %( =600) are male and 40 %( =400) are female. While

selecting the members for our sample, we have to ensure that the number of units from each

stratum is proportionate to the number of males and females in the population. To achieve this,

we first multiply the desired sample size (n=200) by the proportion of units (60% and 40%) in

each stratum.

Hence, number of males for required sample = 200 × 60% = 200 ×60

100= 120

And number of females for required sample = 200 × 40% = 200 ×40

100= 80

This means that we need to select 60 male students and 40 female students for our sample of

100 students.

f) Select sample of required size:

Finally we have to select 120 male students from 600 and 80 female students from 400

using either simple random sampling or systematic random sampling to fulfil sample size.

The principal reasons for using stratified random sampling rather than simple random

sampling include:

1. Stratification may produce a smaller error of estimation than would be produced by a simple

random sample of the same size. This result is particularly true if measurements within strata

are very homogeneous.

2. The cost per observation in the survey may be reduced by stratification of the population

elements into convenient groupings.

3. Estimates of population parameters may be desired for subgroups of the population. These

subgroups should then be identified.

Page 105: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

97

Merits of simple random sampling:

1) It provides us with a sample that is highly representative of the population being studied,

assuming that there is limited missing data.

2) It allows us to make statistical conclusions from the data collected that will be considered to

be valid.

3) It improves the potential for the units to be more evenly spread over the population.

4) It improves the representation of particular strata (groups) within the population, as well as

ensuring that these strata are not over-represented.

5) Stratification gives a smaller error in estimation and greater precision than the simple

random sampling method.

Demerits of simple random sampling:

1) It is possible for the list of the population to be clearly described into each stratum; that is,

each unit from the population must only belong to one stratum.

2) Even if a list is readily available, it may be challenging to gain access to that list. The list may

be protected by privacy policies or require a length process to attain permissions.

3) It may be difficult and time consuming to bring together numerous sub-lists to create a final

list from which you want to select your sample.

4) It can increase costs to carry out the research.

4. Cluster Sampling:

This technique is generally used in case of homogeneous population. It is the sampling

technique in which the population is divided into separate groups known as clusters. A complete

list of clusters represents the sampling frame. Each element of the population can be assigned to

one, and only one, cluster. Then, a few clusters are chosen randomly as the source of primary data.

Elements in the clusters are then sampled together for the required sample. Cluster sampling can

be one-stage or two-stage sampling. This is a popular method in conducting marketing researches.

Merits of cluster sampling:

1) This technique is cheap, quick and easy. Instead of sampling an entire country, the researcher

can allocate his limited resources to the few randomly selected clusters or areas when using

cluster sample.

2) It reduces variability and increases the levels of efficiency of sampling.

3) This method is easy to be used from practicality viewpoint.

Demerits of cluster sampling:

1) This technique is the least representative of the population as compared to other sampling

techniques.

2) It is a sampling technique with the possibility of high sampling error.

Page 106: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

98

The Difference between Stratified and Cluster sampling:

Strata and clusters are both non-overlapping subsets of the population, they differ in several ways.

In stratified sampling only specific elements of strata are accepted as sampling unit; while in

cluster sampling a cluster is perceived as a sampling unit.

With stratified sampling, the best survey results occur when elements within strata are

internally homogeneous. However, with cluster sampling, the best results occur when

elements within clusters are internally heterogeneous.

The main difference between cluster sampling and stratified sampling lies with the inclusion

of the cluster or strata.

In stratified random sampling, all the strata of the population are sampled while in cluster

sampling, the researcher only randomly selects a number of clusters from the collection of

clusters of the entire population. Therefore, only a number of clusters are sampled, all the

other clusters are left unrepresented.

Multi-stage sampling

Multi-stage sampling (also known as multi-stage cluster sampling) is a more complex form

of cluster sampling which contains two or more stages in sample selection. In multi-stage sampling

large clusters of population are divided into smaller clusters in several stages in order to make

primary data collection more manageable. It has to be noted that multi-stage sampling is not as

effective as random sampling; however, it addresses certain disadvantages associated with random

sampling such as being overly expensive and time-consuming.

Merit of Multi-stage sampling:

1) It is effective in primary data collection from geographically dispersed population.

2) It is cost-effective and time-effective.

3) This method has high level of flexibility.

Demerits of Multi-stage sampling:

1) This method is not highly representative of whole population.

2) It has high level of subjectivity.

3) Group-level information is required at each stage.

5. Quota sampling:

Quota sampling is a type of non-probability sampling technique and it is defined as the

sampling method of collecting representative data from a groups. These sampling groups represent

certain characteristics of the population chosen by researcher.

For example, suppose researcher wants to evaluate the impact of cross-cultural differences

on 10000 students in a University. So he needs to assess the effectiveness of students’ motivational

tools taking into account gender differences among the University.

Page 107: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

99

Steps to create Quota sample:

a) Define population

b) Choose the relevant stratification and divide the population accordingly

c) Calculate quota from each stratum

d) Continue to invite cases until the quota for each stratum is fulfilled

a) Define population:

Here the population is of 10,000 students at the University which is population size (𝑁) and

we require selecting 100 students which is the sample size (𝑛).

b) Choose the relevant stratification and divide the population accordingly:

Students in the university as the sampling frame need to be divided into following groups

(strata) according to their cultural background:

i) North Indian

ii) South Indian

iii) East Indian

iv) West Indian

c) Calculate quota from each stratum:

The number of cases that should be included in each stratum will vary depending on the

make-up of each stratum within the population. If we have to examine the differences in male and

female students then number of students from each group that we would include in the sample

would be based on the proportion of male and female students amongst the 10,000 university

students.

For example, if there were 6,000 male students (60% of the total) and 4,000 female

students (40% of the total), our sample would need to be made up of 60% males and 40% females.

If our desired sample size was 100 students, this would mean our sample should include 60 male

students and 40 female students.

d) Continue to invite cases until the quota for each stratum is fulfilled:

Once we have selected the number of cases you need in each stratum, you simply need to

keep inviting participants to take part in your research until each of these quotas are filled.

Merits of Quota sampling:

1) It is particularly used when we failed to obtain probability sample.

2) It is easier and quicker to carry out as it doesn’t require sampling frame.

3) It improves representation of particular strata within the population.

Demerit of Quota sampling:

1) It doesn’t allow to use random sample and hence sampling error cannot be determined.

2) It is not possible to make statistical inferences from sample to the population.

Page 108: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

100

Exercise

1. Define population. Explain need of sample in detail.

2. Differentiate census and sample.

3. Explain simple random sampling with replacement and without replacement.

4. Distinguish between: simple random sampling with replacement and without replacement.

5. Write short note on systematic sampling.

6. Distinguish between: stratified sampling and cluster sampling.

7. Explain quota sampling.

8. Differentiate between systematic sampling and stratified sampling.

9. Give advantages and disadvantages of simple random sampling.

10. Explain the advantages of sampling.

Page 109: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

101

4. Correlation

Introduction:

In the previous lesson, we learned about the joint probability distribution of two random

variables X and Y. In this lesson, we'll extend our investigation of the relationship between two

random variables by learning how to quantify the extent or degree to which two random

variables X and Y are associated or correlated.

The term correlation is used by a common person in day to day life without knowingly or

unknowingly. For example, when parents advice their children to work hard so that they may get

good marks, they are correlating good marks with hard work.

In the previous lesson we have studied about characteristics, measures of central tendency

and measure of dispersion of one variable i.e. univariate data. But there are variables which are

related to each other. E.g. height and weight of persons are related to each other. Such a data

containing two variables which are related to each other is called bivariate data in statistical

analysis. Sometimes the variables may be interrelated like blood pressure and age. The nature and

strength of relationship may be studied by correlation and regression.

Correlation:

In statistical analysis, two sets of data or two random variables may depend on each other

in such way that the increase or decrease in values of one variable results in either increase or

decrease in values of anther variable. The extent of linear relationship between two variables or

more variables is called correlation.

E.g. correlation in demand for a product and its price

Correlation is a single number that describes the degree of linear relationship between two

variables. It is a statistical technique which shows how strongly pairs of variables are related. Two

variables are said to be correlated, if change in one of the variables results in a change in the other

variable.

Uses of correlation:

1. It is used in physical and social sciences.

2. Businessmen estimates costs, sales, price etc. using correlation.

3. It is useful for economists to study the relationship between variables like price, quantity

etc.

4. Businessmen estimates costs, sales, price etc. using correlation.

5. It is helpful in measuring the degree of relationship between the variables like income and

expenditure, price and supply, supply and demand etc.

6. Sampling error can be calculated.

7. It is the basis for the concept of regression.

Page 110: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

102

Scatter Diagram:

Scatter diagram is the diagrammatic representation of relationship between two variables.

It is the simplest method of studying correlation. In scatter diagram, one variable is taken along

horizontal axis and second variable is taken along vertical axis. Each pair of observations of two

variables is represented by dot in the plane of axes. There are as many dots in the plane as the

number of paired observations of two variables. The direction of dots shows the scattering or

concentration of given points which further helps to decide the type of correlation.

The following are the types of correlation:

1) Positive Correlation:

If the change in values of one variable leads to the same change in values of another variable

then it is positive correlation. It is a relationship between two variables which moves in same

direction. In positive correlation if values of one variable decrease then values of other variables

also decrease and vice versa.

E.g. Price and supply are two variables, which are positively correlated. When Price increases,

supply also increases; when price decreases, supply decreases.

The scatter diagram for positive correlation is shown below. The line corresponding to the

scatter plot is an increasing line.

Positive Correlation

2) Negative Correlation:

If the change in values of one variable leads to the opposite change in values of another

variable then it is negative correlation. It is a relationship between two variables which moves in

opposite direction. In negative correlation if values of one variable decrease then values of other

variables also increase or if values of one variable increase then values of second variable decrease.

E.g. Price and demand are two variables which are negatively correlated. When price increases,

demand decreases; when price decreases, demand increases.

The scatter diagram for positive correlation is shown below. The line corresponding to the

scatter plot is a decreasing line.

Page 111: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

103

Negative Correlation

3) Zero Correlation:

When there does not exist any relationship between two variables then it is zero

correlation. The increase or decrease in values of one variable does not affect other variable.

E.g. The more weight I gain, the smarter I will be. Intelligence is not affected by weight i.e. there is

no relation between these two variables.

The scatter diagram for zero correlation is shown below. No correlation occurs when there

is no linear dependency between two variables.

Zero Correlation

Merits of Scatter diagram:

1. It is a simplest and attractive method of finding the nature of correlation between the two

variables.

2. It is a non-mathematical method and easy to understand.

3. It is not affected by extreme items.

4. It is the first step in finding out the relation between the two variables.

5. We can have a rough idea at a glance whether it is a positive correlation or negative

correlation.

Demerits of Scatter diagram:

By this method we cannot get the exact degree or correlation between the two variables.

Correlation Coefficient:

The scatter diagram does not give the exact idea about the existence of relationship

between two variables. Instead, a number can give a good idea about how closely one variable is

related to another variable. If there is any relationship between two variables, we need to measure

Page 112: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

104

the degree of that relationship. This measure of correlation is called correlation coefficient i.e. the

numerical value that determines the degree to which two variables are related to each other in unit-

free terms is known as correlation coefficient. It gives the strength and direction of a linear

relationship.

Covariance:

Before studying correlation coefficient we will start with covariance which computes the

dependence between two random variables say X and Y.

i.e. if X and Y are two random variables (discrete or continuous) with respective means 𝑥 and 𝑦 then

covariance of X and Y, denoted by Cov(X, Y), is defined as:

Cov(X, Y) = 𝑥𝑖 − 𝑥 (𝑦𝑖− 𝑦 )

𝑛=

𝑥𝑖𝑦𝑖

𝑛− 𝑥 𝑦

where 𝑥𝑖 are observations in X and 𝑦𝑖 are observations in Y.

Note:

1. The value of correlation coefficient lies in between -1 and +1.

2. If correlation coefficient = 1 then it is perfectly positive correlation.

3. If correlation coefficient = -1 then it is perfectly negative correlation.

4. If correlation coefficient = 0 then it is zero correlation i.e. there is no correlation.

5. If correlation coefficient >0 then variables are positively correlated.

6. If correlation coefficient <0 then variables are negatively correlated.

Coefficient of correlation can be measured using two methods:

1) Karl Pearson’s Correlation Coefficient (r)

2) Spearman’s Rank Correlation Coefficient (R)

1) Karl Pearson’s Correlation Coefficient (r):

This is a simple and the most common way to measure degree of correlation between two

variables. It is also known as product-moment correlation coefficient. It is measure of the strength

as well as direction of a linear relationship between two variables. It tries to draw a line of best fit

through the data of two variables and indicates how far the points are away from the line of fit.

If 𝑥1 ,𝑥2 ,𝑥3,…𝑥𝑛 are n observations of variable X and 𝑦1 ,𝑦2 ,𝑦3,…𝑦𝑛are n observations of

variable Y then Karl Pearson’s Correlation Coefficient, denoted by r, is defined as

𝑟 = 𝐶𝑜𝑣(𝑋,𝑌)

𝜎𝑥𝜎𝑥 where 𝜎𝑥 =

𝑥𝑖 − 𝑥 2

𝑛 and 𝜎𝑦 =

𝑦𝑖 − 𝑦 2

𝑛

Thus,

𝒓 = 𝒏 𝒙𝒊𝒚𝒊 − 𝒙𝒊 ∙ 𝒚𝒊

[𝒏 ( 𝒙𝒊𝟐) – ( 𝒙𝒊)

𝟐 ] ∙ [𝒏 ( 𝒚𝒊

𝟐) – ( 𝒚𝒊)𝟐

]

OR

𝒓 = 𝒏 𝒅𝒙 ∙𝒅𝒚− 𝒅𝒙∙ 𝒅𝒚

𝒏 𝒅𝒙𝟐 – ( 𝒅𝒙)𝟐 ∙ 𝒏 𝒅𝒚𝟐 – ( 𝒅𝒚)

𝟐

Where 𝑑𝑥 = (𝑥𝑖 − 𝑥 ) and 𝑑𝑦 = (𝑦𝑖 − 𝑦 )

Page 113: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

105

Steps:

1. Find the means 𝑥 , 𝑦 of two variables X and Y.

2. Take the deviations dx, dy of two series X and Y using the formula. Then take their squares

as 𝑑𝑥2 and 𝑑𝑦2 and prepare the table as shown below.

X Y dx dy 𝑑𝑥2 𝑑𝑦2 dxˑdy

3. Calculate total of each column.

4. Substitute the values in the formula of r and find r.

Ex.1. Calculate the coefficient of correlation from the 7 pairs of observations, given that,

𝑥 = 212, 𝑦 = 152, 𝑥2 = 6514, 𝑦2 = 3390, 𝑥𝑦 = 4681.

Ans: 𝒓 = 𝒏 𝒙𝒊𝒚𝒊 − 𝒙𝒊 ∙ 𝒚𝒊

[𝒏 ( 𝒙𝒊𝟐) – ( 𝒙𝒊)

𝟐 ] ∙ [𝒏 ( 𝒚𝒊

𝟐) – ( 𝒚𝒊)𝟐

]

= 7×4681−212×152

[7×6514 –(212)2] ∙ [7×3390−(152)2]

=32767−32224

45598−44944 ∙[23730−23104 ]

=543

654 ×626

= 543

639.8468

r = 0.8486

Ex.2. Find Karl Pearson’s coefficient of correlation from the following data between height of father

(x) and son (y).

X 64 65 66 67 68 69 70

Y 66 67 65 68 70 68 72

Ans:

X Y dx= 𝒙𝒊 − 𝟔𝟕 dy= 𝒚𝒊 − 𝟔𝟖 𝒅𝒙𝟐 𝒅𝒚𝟐 dxˑdy

64

65

66

67

68

69

70

66

67

65

68

70

68

72

-3

-2

-1

0

1

2

3

-2

-1

-3

0

2

0

4

9

4

1

0

1

4

9

4

1

9

0

4

0

16

6

2

3

0

2

0

12

469 476 0 0 28 34 25

Page 114: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

106

𝑥 = 𝑥

𝑛=

469

7= 67 & 𝑦 =

𝑦

𝑛=

476

7= 68

𝒓 = 𝒏 𝒅𝒙 ∙𝒅𝒚− 𝒅𝒙∙ 𝒅𝒚

𝒏 𝒅𝒙𝟐 – ( 𝒅𝒙)𝟐 ∙ 𝒏 𝒅𝒚𝟐 – ( 𝒅𝒚)

𝟐

= 7×25−0∙0

7×28 – (0)2 ∙ 7×34 – (0)2

= 175

196 ×238

=175

215.9814

r = 0.810

Ex.3. Calculate the correlation coefficient for the following heights of fathers (x) and their sons (y).

x 65 66 67 67 68 69 70 72

y 67 68 65 68 72 72 69 71

Ans:

𝑥 = 𝑥

𝑛=

544

8= 68 & 𝑦 =

𝑦

𝑛=

552

8= 69

𝒓 = 𝒏 𝒅𝒙 ∙𝒅𝒚− 𝒅𝒙∙ 𝒅𝒚

𝒏 𝒅𝒙𝟐 – ( 𝒅𝒙)𝟐 ∙ 𝒏 𝒅𝒚𝟐 – ( 𝒅𝒚)

𝟐

= 8×24−0∙0

8×36 – (0)2 ∙ 8×44 – (0)2

= 192

288 ×352

=192

318.3959

𝑟 = 0.6030

X Y dx= 𝒙𝒊 − 𝟔𝟖 dy= 𝒚𝒊 − 𝟔𝟗 𝒅𝒙𝟐 𝒅𝒚𝟐 dxˑdy

65

66

67

67

68

69

70

72

67

68

65

68

72

72

69

71

-3

-2

-1

-1

0

1

2

4

-2

-1

-4

-1

3

3

0

2

9

4

1

1

0

1

4

16

4

1

16

1

9

9

0

4

6

2

4

1

0

3

0

8

544 552 0 0 36 44 24

Page 115: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

107

Merits of correlation coefficient:

This method not only indicates the presence or absence of correlation between any two

variables but also determines the exact extent or degree to which they are correlated.

It is easy to identify type of correlation between the two variables i.e. positive or negative.

It helps to estimate the value of a dependent variable with reference to a particular value of

an independent variable through regression equations.

Demerits of correlation coefficient:

It is very much affected by the values of the extreme items.

In comparison to the other methods, it takes much time to arrive at the results.

It assumes a linear relationship between the variables even though it may not be there.

It is liable to be misinterpreted, as a high degree of correlation Hp does not necessarily mean

very close relationship between the variables.

It is tedious to calculate

2) Spearman’s Rank Correlation Coefficient (R):

This non-parametric method is used to determine the degree of correlation if one of the two

variables or both variables are qualitative in nature. In some cases ranks of variables are already

given, but if ranks are not given then it is required to assign ranks by the observer.

If 𝑥1 ,𝑥2 ,𝑥3,…𝑥𝑛 are n observations of variable X and 𝑦1 ,𝑦2 ,𝑦3,…𝑦𝑛are n observations of

variable Y then Spearman’s Rank Correlation Coefficient, denoted by R, is defined as

𝑅 = 1−6 ∙ 𝐷2

𝑛(𝑛2−1) where D = Rx - Ry

Rx = Ranks of data X

Ry = Ranks of data Y

Ex.1. Calculate Rank correlation coefficient from following data.

Marks by Judge A 81 72 60 33 29 11 56 42

Marks by Judge B 75 56 42 15 30 20 60 80

Ans:

X Y Rx Ry D= Rx - Ry D2

81

72

60

33

29

11

56

42

75

56

42

15

30

20

60

80

1

2

3

6

7

8

4

5

2

4

5

8

6

7

3

1

-1

-2

-2

-2

1

1

1

4

1

4

4

4

1

1

1

16

Total 32

Page 116: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

108

𝑅 = 1−6 ∙ 𝐷2

𝑛(𝑛2 − 1)= 1−

6 × 32

8 × 63= 1− 0.3809 = 0.6191

Ex.2. Psychological tests of intelligence and arithmetical ability were applied to 10 candidates.

Results are given in table. Compute rank correlation coefficient between X and Y.

Intelligence ration (X) 90 95 115 96 85 110 89 98 97 93

Arithmetical ration (Y) 95 90 110 100 85 105 94 106 111 93

Ans.:

X Y 𝑅𝑋 𝑅𝑦 𝐷2 = (𝑅𝑋 − 𝑅𝑦)2

90 95 8 6 4

95 90 6 9 9

115 110 1 2 1

96 100 5 5 0

85 85 10 10 0

110 105 2 4 4

89 94 9 7 4

98 106 3 3 0

97 111 4 1 9

93 93 7 8 1

Total 32

𝑅 = 1− 6 𝐷2

𝑁 (𝑁2 − 1)

=

1− 6 𝑋 32

10 (99)

= 1 – 0.194 = 0.806

Ex.3. In dance competition, two judges rank 10 participants in following order. From given data

calculate coefficient of rank correlation?

Ranking by judge M 6 4 3 1 7 8 9 10 5 2

Ranking by judge N 4 1 6 7 8 7 10 3 2 5

Page 117: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

109

Ans.:

Rank by M

𝑅𝑥

Rank by N

𝑅𝑥 𝐷2 = (𝑅𝑋 − 𝑅𝑦)2

6 4 4

4 1 9

3 6 9

1 7 36

7 8 1

8 7 1

9 10 1

10 3 49

5 2 9

2 5 9

+TOTAL 128

𝑅 = 1− 6 𝐷2

𝑁 (𝑁2 − 1)

= 1− 6 𝑋 128

10 (99)

= 1 – 0.775

= 0.225

Merits of Rank Correlation Coefficient:

Spearman's Rank method is the only way of studying correlation between qualitative data

which cannot be measured in figures but can be arranged in serial order.

Demerits of Rank Correlation

1) The method cannot" be used in two-way frequency tables or bi-variate frequency

distribution.

2) It can be conveniently used only when n is small say 30, otherwise calculation become

tedious.

Page 118: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

110

Exercise

1. Define correlation. Explain different methods of studying correlation.

2. What is Spearman’s rank correlation? When it can be used?

3. Om electrical obtained 120 tube lights from two companies and tested their life in hours.

The following results were obtained. Calculate the coefficient of variation and find which

company’s tubes are more durable?

Life of tubes (hrs) Company A Company B

800-1000 12 15

1000-1200 20 22

1200-1400 38 40

1400-1600 12 13

1600-1800 15 28

1800-2000 03 02

4. From given data of height of father and daughter in centimeters, calculate the correlation

coefficient.

Father 165 168 160 163 170 175 173

Daughter 160 175 166 159 173 180 177

5. Following table shows ages (X) in years and blood pressure (Y). From given data calculate

correlation coefficient.

X 25 50 60 43 51 74 46 33 49 58

Y 120 135 140 115 130 133 126 139 125 136

6. In epidemiological study of glaucoma in urban and rural population following data was

made available by WHO. Find if there is any correlation between urban and rural area.

No. of cases per 1000

Urban 23 35 28 36 45 39 19

Rural 20 30 22 40 35 45 22

7. Examine correlation for given data containing erythrocytes sedimentation rate in mm/hr of

10 male and female.

Male 112 65 70 82 105 75 60

Female 85 100 90 63 78 105 90

Page 119: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

111

8. From data given in table find out the value of Karl Pearson’s Coefficient of correlation.

Fertilizer used 15 19 22 27 35 40 50

Productivity 80 95 102 118 135 144 150

9. Calculate the coefficient of correlation for given data of marks obtained by students in

Pharmaceutics and Pharmacology.

Pharmaceutics 75 51 42 77 62 81 60 58 66 49

Pharmacology 69 48 64 45 71 42 64 70 40 65

10. Compute coefficient of correlation from following data of supply and price of goods.

Supply 182 160 152 169 158 166 179

Price 167 198 152 170 162 152 180

11. Table contains values of import of raw material and export of finished formulation in

suitable unit. Calculate coefficient of correlation.

Import 15 21 15 16 25 19 12 25 21 10

Export 12 16 14 14 22 17 10 23 19 09

12. In a vocal music contest, two judges rank 09 competitors in following order.

Judge A 5 10 8 9 7 5 4 6 7 3

Judge B 3 9 9 6 10 8 7 8 3 6

13. Calculate Karl Pearson’s coefficient of correlation between x and y. State its kind.

X 39 65 62 90 82 75 25 98 36 78

Y 47 53 58 86 62 68 60 91 51 84

14. Calculate correlation coefficient for the following data.

X 1 3 4 8 9 11 14

Y 1 2 4 5 7 8 9

15. Given the following values of x and y. Find the correlation coefficient.

X 3 5 6 8 9 11

Y 2 3 4 6 5 8

16. Calculate correlation coefficient for the following data.

X 12 9 8 10 11 13 7

Y 14 8 6 9 11 12 3

Page 120: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

112

5. Regression

Introduction:

After studying the relationship between two variables, now in this chapter we are going to

estimate the values of one variable when values of another variable are given. The variable which is

to be estimated is called “dependent” variable and the other is “independent” variable. Thus, the

term regression is used when we want to predict value of a variable based on the value of another

variable. Correlation gives us the extent of linear relationship between two variables while

regression analysis gives the measure of average relationship between two or more variables in

terms of original units of data.

The statistical technique of determining unknown values of one variable from the known

values of another variable is called Regression analysis. The relationship between two variables like

rainfall and agricultural production, consumer expenditure and disposable income etc are examples

of regression. Regression analysis is also used to define and characterize dose-response

relationships, for fitting linear portions of pharmacokinetic data and in obtaining the best fit to

linear physical-chemical relationships.

Difference between correlation and regression:

Correlation is related to regression but its application and interpretation are different than

regression. Let us see the exact difference between correlation and regression as both describe the

strength of the linear relationship between two or more variables.

Regression Correlation

1. It predicts the values of dependent variable

based on the known values of independent

variable, assuming the average relationship

between two or more variable.

2. It involves at least one independent

variable which is under researchers’

control.

3. It gives method to describe nature of

relationship.

4. Regression coefficient predicts value y from

the value of x, or vice versa.

1. It gives the association or intensity of

relationship between two variables

(x and y).

2. There is no concept of dependent or

independent variable.

3. It simply describes the strength and

direction of relationship.

4. Correlation coefficient gives idea of

relationship between two or more variables.

Page 121: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

113

Types of Regression:

Regression analysis can be classified into following types:

1. Simple Regression: In regression analysis, if only two variables are studied at a time then it

is called simple regression i.e. there is only one independent variable.

2. Multiple Regressions: In regression analysis, if more than two variables are studied at a

time then it is called multiple regression i.e. there are two or more independent variables.

3. Linear Regression: If the graphical representation of a given data gives a straight-lined

pattern then it is linear regression.

4. Non-linear Regression: if the graphical representation of given data gives curved pattern

line then it is non-linear or curvilinear regression.

In this chapter, we are going to study Simple linear regression and its equations.

Simple Linear Regression:

Simple linear regression uses only one independent variable and examines the linear

relationship between two continuous variables: dependent (y) and independent (x) using straight

line. When the two variables are related, it is possible to predict a response value from a predictor

value with better than chance accuracy.

Regression provides the line that "best" fits the data. This line can then be used to:

Examine how the response variable changes as the predictor variable changes.

Predict the value of a dependent variable (y) for independent variable (x).

Regression Lines:

In regression analysis of two variables, regression line is a smooth curve fitted to the set of

paired data of x and y; and if the curve is straight line then it is line of linear regression. There are as

many number of regression lines as variables. But in simple linear regression we take two variables

X and Y, so there are only two regression lines:

Regression line of Y on X: This gives the most probable values of Y from the given values of X.

Regression line of X on Y: This gives the most probable values of X from the given values of Y.

Properties of Regression Lines:

(i) For perfect correlation i.e. 𝑟 = ±1, the two lines coincide each other. So there will be only

one straight line.

(ii) If 𝑟 = 0 then both variables are independent and both lines will cut each other at right

angle.

(iii) If regression lines are close to each other then there is high degree of correlation.

(iv) If regression lines are far away from each other then there is less degree of correlation.

(v) The two regression lines intersect each other at point (𝑥 ,𝑦 ) i.e. means of X and Y.

Page 122: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

114

Linear Regression Equation:

A regression analysis generates two equations which describe the statistical linear

relationship between two given variables x and y such as

𝑦 = 𝑎 + 𝑏𝑥 ...... (i)

Or

𝑥 = 𝑐 + 𝑑𝑦 ...... (ii)

Such algebraic expression of regression lines which show linear relationship between two

variables in form of straight lines is called as Regression Equations.

From equation

(i) We can estimate unknown Y from known values of X and known as regression equation of

Y on X.

(ii) We can estimate unknown X from known values of Y and known as regression equation of

X on Y.

Methods of Linear Regression Analysis:

Following chart represents various methods of regression analysis.

Regression Methods

Graphic Algebraic

Scatter Diagram Least Square Method

(i) Scatter Diagram:

Using this method the points of two variables X and Y are plotted on graph paper. If

𝑥1 ,𝑥2 ,… 𝑥𝑛 are n observations of variable X and 𝑦1 ,𝑦2 ,… 𝑦𝑛 are n observations of variable Y then

we plot the pairs 𝑥1 ,𝑦1 , 𝑥2 ,𝑦2 , … (𝑥𝑛 ,𝑦𝑛) in a diagram. A regression line is then drawn with scale

or free hand so that maximum numbers of points are covered under that straight line. If errors in

estimation of variable Y are minimised then we get regression line of Y on X and vice versa.

(ii) Least Square Method:

The most common method for fitting a regression line is the method of least-squares as

scatter diagram gives several lines which can be drawn through the given points. This method

calculates the best-fitting line for the observed data by minimizing the sum of the squares of the

vertical deviations from each data point to the line (if a point lies on the fitted line exactly, then its

vertical deviation is 0). Because the deviations are first squared, then summed, there are no

cancellations between positive and negative values. A line fitted by the method of least square is

known as the line of best fit.

Page 123: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

115

Suppose 𝑥1 ,𝑥2 ,… 𝑥𝑛 are n observations of variable X and 𝑦1 ,𝑦2 ,… 𝑦𝑛 are n observations of

variable Y then

(i) Equation of regression line Y on X is given by

𝒚 − 𝒚 = 𝒃𝒚𝒙( 𝒙 − 𝒙 )

Where 𝑦 = 𝑦𝑖

𝑛 and 𝑥 =

𝑥𝑖

𝑛

𝑏𝑦𝑥 = Regression coefficient of Y on X

𝑏𝑦𝑥 = 𝑛 𝑑𝑥 ∙𝑑𝑦 − 𝑑𝑥 ∙ 𝑑𝑦

𝑛 𝑑𝑥 2 − ( 𝑑𝑥 )2 𝑑𝑥 = 𝑥𝑖 − 𝑥

𝑑𝑦 = 𝑦𝑖 − 𝑦

(ii) Equation of regression line X on Y is given by

𝒙 − 𝒙 = 𝒃𝒙𝒚( 𝒚 − 𝒚 )

Where 𝑦 = 𝑦𝑖

𝑛 and 𝑥 =

𝑥𝑖

𝑛

𝑏𝑥𝑦 = Regression coefficient of X on Y

𝑏𝑥𝑦 = 𝑛 𝑑𝑥 ∙𝑑𝑦− 𝑑𝑥 ∙ 𝑑𝑦

𝑛 𝑑𝑦 2 − ( 𝑑𝑦 )2 𝑑𝑥 = 𝑥𝑖 − 𝑥

𝑑𝑦 = 𝑦𝑖 − 𝑦

Another Form of Regression coefficient:

If 𝜎𝑥 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑋

𝜎𝑦 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑌

𝑟 = 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌

Then

(i) Regression coefficient of Y on X is given by

𝑏𝑦𝑥 = 𝑟 ∙𝜎𝑦

𝜎𝑥

And

(ii) Regression coefficient of X on Y is given by

𝑏𝑥𝑦 = 𝑟 ∙𝜎𝑥

𝜎𝑦

Properties of Regression Coefficient:

(1) The algebraic signs of both regression coefficients must be same i.e. either positive (+) or

negative (-).

(2) The geometric mean of both regression coefficients is equal to correlation coefficient i.e.

Page 124: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

116

𝑟 = ± 𝑏𝑥𝑦 ∙ 𝑏𝑦𝑥 or 𝑟2 = 𝑏𝑥𝑦 ∙ 𝑏𝑦𝑥

(3) The correlation coefficient will have same sign as that of the regression coefficients.

(4) If value of one regression coefficient is greater than one then value of other regression

coefficient must be less than one.

(5) The regression coefficients are independent of origin but not of scale.

(6) If regression line Y on X is of the form

𝑦 = 𝑎 + 𝑏𝑥

Then 𝑏 = 𝑏𝑦𝑥 i.e. regression coefficient of Y on X

(7) The two regression lines intersect each other at point (𝑥 ,𝑦 ) i.e. means of X and Y.

(8) If regression line X on Y is of the form

𝑥 = 𝑐 + 𝑑𝑦

Then 𝑑 = 𝑏𝑥𝑦 i.e. regression coefficient of X on Y

(9) Angle between the two regression lines is given by

tan𝜃 = 𝑚1−𝑚2

1+𝑚1𝑚2 where 𝑚1 and 𝑚2 are gradients of regression lines and

𝑚1 =𝜎𝑦

𝑟∙𝜎𝑥, 𝑚2 =

𝑟∙𝜎𝑦

𝜎𝑥

Thus, 𝜃 = tan−1 (1−𝑟2)

𝑟

𝜎𝑥 ∙𝜎𝑦

𝜎𝑥2+ 𝜎𝑦

2

(10) The angle between regression lines indicates the degree of dependence between the variables.

Ex.1. If values of two regression coefficients are 0.75 and 0.2.

Ans: let 𝑏𝑥𝑦 = 0.75 𝑎𝑛𝑑 𝑏𝑦𝑥 = 0.2

𝑟 = ± 𝑏𝑥𝑦 ∙ 𝑏𝑦𝑥

𝑟 = ± 0.75 × 0.2 = ± 0.15 = ±0.3873

Ex.2. Find r, if 𝑏𝑥𝑦 = 0.8 𝑎𝑛𝑑 𝑏𝑦𝑥 = 0.46.

Ans: let 𝑏𝑥𝑦 = 0.8 𝑎𝑛𝑑 𝑏𝑦𝑥 = 0.46

𝑟 = ± 𝑏𝑥𝑦 ∙ 𝑏𝑦𝑥

𝑟 = ± 0.8 × 0.46 = ± 0.368 = ±0.6066

Ex.3. From data given in table, calculate two lines of regression.

X 16 20 17 21 15

Y 50 60 58 60 55

i) Estimate value of Y when 𝑋 = 25

ii) Estimate value of X when 𝑌 = 50

Page 125: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

117

Ans.: Prepare a table as from given data;

𝑥 y 𝑑𝑥 = 𝑥 − 𝑥 𝑑𝑦 = 𝑦 − 𝑦 𝑑𝑥 𝑑𝑦 𝑑𝑥2 𝑑𝑦2

16 50 -2 -7 14 4 49

20 60 2 3 6 4 9

17 58 -1 1 -1 1 1

21 60 3 3 9 9 9

15 55 -3 -2 6 9 4

𝑥

= 89

𝑦

= 283

𝑑𝑥 = −1 𝑑𝑦 = −2 𝑑𝑥 𝑑𝑦

= 34

𝑑𝑥2 = 27 𝑑𝑦2 = 72

𝑥 = 𝑥

𝑛=

89

5= 17.8 ≅ 18

𝑦 = 𝑦

𝑛=

283

5= 56.6 ≅ 57

(i) Part A: Equation of regression line Y on X is given by

𝒚 − 𝒚 = 𝒃𝒚𝒙( 𝒙 − 𝒙 )

Where, regression coefficient y on x is given by

𝑏𝑦𝑥 = 𝑛 𝑑𝑥 ∙ 𝑑𝑦 – 𝑑𝑥 ∙ 𝑑𝑦

𝑛 𝑑𝑥2 − 𝑑𝑥 2

= 5 × 34− (−1 ×−2)

5 × 27− −1 2

= 170− 2

135− 1

= 1.25

Hence equation becomes,

𝑦 − 56.6 = 1.25 𝑥 − 17.8

𝑦 − 56.6 = 1.25𝑥 − 22.25

𝑦 − 1.25𝑥 = 34.35

(i) Now, Equation of regression line X on Y is given by

𝒙 − 𝒙 = 𝒃𝒙𝒚( 𝒚 − 𝒚 )

Where, Regression coefficient of X on Y is given by

𝑏𝑥𝑦 = 𝑛 𝑑𝑥 ∙𝑑𝑦− 𝑑𝑥 ∙ 𝑑𝑦

𝑛 𝑑𝑦 2 − ( 𝑑𝑦 )2

= 5 × 34− (−1 ×−2)

5 × 72− (−2)2

Page 126: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

118

= 170− 2

360− 4

= 0.4719

Now, regression line X on Y:

𝑥 − 𝑥 = 𝑏𝑥𝑦 ( 𝑦 − 𝑦 )

𝑥 − 17.8 = 0.4713( 𝑦 − 56.6 )

𝑥 − 17.8 = 0.4713𝑦 − 26.71

𝑥 − 0.4713𝑦 = −8.91

i) To estimate value of y when 𝑥 = 25, use equation of regression line Y on X

Therefore, 𝑦 − 1.25(25) = 34.35

𝑦 − 31.25 = 34.35

𝑦 = 65.6

ii) To estimate value of x when 𝑦 = 50, use regression line X on Y

Therefore; 𝑥 − 0.4713(50) = −8.91

𝑥 − 23.56 = −8.91

𝑥 = 14.65

Ex.3. Find the line of regression Y on X and line X on Y if

X Y

A.M. 36 85

S.D. 11 8

r 0.66

Ans.: Given 𝑥 = 36 𝑦 = 85

𝜎𝑥 = 11 𝜎𝑦 = 8 𝑟 = 0.66

Now,

Regression coefficient of Y on X is given by

𝑏𝑦𝑥 = 𝑟 ∙𝜎𝑦

𝜎𝑥= 0.66 ×

8

11= 0.4818

And

Regression coefficient of X on Y is given by

𝑏𝑥𝑦 = 𝑟 ∙𝜎𝑥

𝜎𝑦= 0.66 ×

11

8= 0.9075

Hence,

Equation of regression line Y on X is

𝒚 − 𝒚 = 𝒃𝒚𝒙 𝒙 − 𝒙

𝑦 − 85 = 0.48 𝑥 − 36

Page 127: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

119

𝑦 − 85 = 0.48𝑥 − 30.24

𝑦 − 0.48𝑥 = 54.76

Equation of regression line Y on X is

𝒙 − 𝒙 = 𝒃𝒙𝒚( 𝒚 − 𝒚 )

𝑥 − 36 = 0.91 𝑦 − 85

𝑥 − 36 = 0.91𝑦 − 77.35

𝑥 − 0.91𝑦 = −41.35

Note:

Suppose we are given equations of two regression lines and it is not mentioned that which

one is regression equation of Y on X and X on Y.

In such case, always assume that the first equation is Y on X and then calculate regression

coefficients 𝑏𝑦𝑥 and 𝑏𝑥𝑦 .

If these two values satisfy the property of regression coefficients,

𝑟2 = 𝑏𝑥𝑦 ∙ 𝑏𝑦𝑥 < 1

Then our assumption is correct.

Otherwise, interchange these two equations.

Ex.4. Given the two line regression as;

8𝑥 – 10𝑦 + 66 = 0

40𝑥 – 18𝑦 – 214 = 0

Find average of x and y as well as correlation coefficient between x and y.

Ans.:

Part A:

Solve two equations simultaneously to find average of X and Y.

8𝑥 – 10𝑦 = − 66 ……….. (1)

40𝑥 – 18𝑦 = 214 …..….. (2)

Multiply equation 1 with 5 and subtract from equation 2.

40𝑥 – 50𝑦 = − 330

- (40𝑥 – 18𝑦 = 214)

− 32𝑦 = −544

𝑦 = 17

Now, substitute 𝑦 = 17 in equation 1.

8𝑥 – 10 (17) = −66

8𝑥 – 170 = −66

8𝑥 = 104

𝑥 = 13

Therefore; 𝑥 = 13 𝑎𝑛𝑑 𝑦 = 17

Page 128: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

120

Part B: to find correlation coefficient, we have two lines of regression but which line is

regression line Y on X and vice-versa is not known.

Let’s assume that

8𝑥 – 10𝑦 = − 66 is regression line Y on X

40𝑥 – 18𝑦 = 214 is regression line X on Y

Express regression line Y on X in the form of 𝑦 = 𝑎 + 𝑏𝑥

𝑦 = 8

10 𝑥 −

66

10

Therefore, 𝑏𝑦𝑥 =8

10= 0.8

Express regression line X on Y in the form of 𝑥 = 𝑐 + 𝑑𝑦.

𝑥 = 9

20 𝑦 −

107

20

Therefore, 𝑏𝑥𝑦 =9

20= 0.45

Now,

Check whether stated assumption is correct or not.

Check 1: Signs: both regression coefficients are positive.

Check 2 Product of two regression coefficient

𝑏𝑦𝑥 .𝑏𝑥𝑦 = 8

10 𝑋

09

20

= 72

200

= 0.36 < 1

Hence, our assumption is correct

Now; 𝑟 = ± 𝑏𝑥𝑦 ∙ 𝑏𝑦𝑥 = 𝑟 = ± 0.36 = ±0.6

But the sign of correlation coefficient is same as sign of both regression coefficients. So, 𝑟 = +0.6

Page 129: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

121

Exercise

1. What is regression analysis? Explain the concepts of regression.

2. Comment on the properties of regression coefficient and lines.

3. Write a note on different methods to find regression coefficient.

4. For a certain data of two variables, the regression equations are 6𝑥 + 𝑦 – 31 = 0 𝑎𝑛𝑑 3𝑥 +

2𝑦 – 26 = 0. Find the means of x and y as well as coefficient of correlation r.

5. From given regression equations calculate coefficient correlation and 𝜎𝑦2; where 𝜎𝑥

2 = 0.9

Regression equations: 8𝑥 – 10𝑦 + 66 = 0 and 40𝑥 – 18𝑦 – 214 = 0.

6. From given data find what will be the probable yield when the rainfall is 30’’. Find the regression

equations when r between rainfall and production = 0.8.

Parameters Rainfall Production (units /acre)

𝑥 25’’ 40

𝜎 3’’ 06

7. Following table contain aptitude test index and productivity indices of 10 workers selected

randomly. Calculate the two regression equations and estimate the productivity index of a

worker whose aptitude score is 92.

Aptitude index 60 62 65 70 72 48 53 73 65 82

Productivity index 68 60 62 80 65 40 52 62 60 81

8. From following data find the two regression lines.

X 7 6 10 14 13

Y 22 18 20 26 24

9. Calculate two lines of regression.

X 7 6 10 14 13

Y 22 18 20 26 24

10. Following data gives values of X and Y.

(i) Calculate two lines of regression.

(ii) Find correlation coefficient.

(iii) Estimate y when x = 10.

11. Two lines of regression are given by 𝑥 + 2𝑦 = 5 and 2𝑥 + 3𝑦 = 8. Calculate the value of

𝑥 ,𝑦 , 𝑏𝑥𝑦 ,𝑏𝑦𝑥 and r.

12. Given are two linear regression equations 𝑥 − 4𝑦 = 5 and 𝑥 − 16𝑦 = −64. Find the value of

𝑥 ,𝑦 , 𝑏𝑥𝑦 ,𝑏𝑦𝑥 and correlation coefficient between X and Y.

X 16 12 18 14 12 10 15 12

Y 87 88 89 86 87 80 85 83

Page 130: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

122

6. Sampling Variability, Significance & Statistical inference

Introduction:

Recall that we typically cannot census the entire population of interest; so we take a sample

from that population in order to make estimates and draw conclusions about the population.

A quantity computed from the values in a sample is called statistic. The values such as

mean 𝑥 , standard deviation or the proportions of individuals in a sample are the statistics which

vary from sample to sample of a population. This variability is called sampling variability.

For example, the average age at which a child learned to walk for one sample of 10 children would

be different from the average age to walk for a different sample of 10 children.

Sampling Distribution:

The common statistics used for sample of any population are sample proportion (𝑝 ) and

sample mean (𝑥 ) which are random variables as they vary from sample to sample. This result into

the distribution of sample called sampling distribution. A sampling distribution is a probability

distribution of a statistic obtained from al large number of samples drawn from a specific

population. It is a distribution of statistics of all possible values of samples of fixed size.

For example, suppose to find out the sampling distribution of GPAT scores for all Graduate

students in a given year, take repeated random samples of graduate students from the general

population of students and then compute the average test score for each sample. The distribution of

those sample means would provide the sampling distribution for the average GPAT score.

The variability of sampling distribution is measured by its variance or standard deviation.

Sampling Error:

Most of the times, the value of statistic calculated from sample is assigned to the population

of that sample. But, in general, there is some difference between the value calculated from sample

and the corresponding value of population. This difference is called sampling error.

Sampling is an analysis performed by selecting specific number of observations from a

larger population. This analysis can produce some errors in selection of samples. In statistics,

sampling error is the error that occurs when the sample representing the entire population is not

selected properly. As a result, the values obtained from sample would not be obtained from entire

population. This sampling error can be eliminated by selecting sufficiently large size sample by

ensuring that it represents the entire population.

Standard Error of the Mean:

The variability of sampling distribution is measured by its variance or standard deviation.

In such case, the standard deviation of the means of samples is a measure of sample error which is

known as standard error or standard error of mean (SEM). It is a measure of uncertainty.

Page 131: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

123

A standard error of the mean is the standard deviation of the sampling distribution of a

statistic. It is a statistical measure which measures the accuracy for sample representing a

population. In statistics, the samples mean deviates from the actual mean of population; this

deviation is known as standard error of the mean. The standard error is inversely proportional to

the sample size; so the larger the sample size, the smaller the standard error.

For example, for an upcoming national election, 2000 voters are chosen at random and

asked if they will vote for candidate A or candidate B. Out of the 2000 voters, 1040 (52%) state that

they will vote for candidate A. The researchers report that candidate A is expected to receive 52%

of the final vote, with a margin of error of 2%.

In this situation, the 2000 voters are a sample from all the actual voters. The sample

proportion of 52% is an estimate of the true proportion who will vote for candidate A in the actual

election. The margin of error of 2% is a quantitative measure of the uncertainty – the possible

difference between the true proportion who will vote for candidate A and the estimate of 52%.

Significance of SEM:

The standard error of mean (SEM) estimates the variability sample means where the

samples are selected from same population while the standard deviation measures the variability

within a single sample. It is used to determine how precisely the mean of the sample estimates the

population mean. The Least value of SEM indicates more precise estimate of population mean. Thus,

a larger sample size will result in a smaller standard error of mean.

The standard error of mean is also used to calculate the confidence interval, which is a

range of values likely to include the population mean.

For example, a medical research team tests a new drug to lower cholesterol. They report

that, in a sample of 400 patients, the new drug lowers cholesterol by an average of 20 units

(mg/dL). The 95% confidence interval for the average effect of the drug is that it lowers cholesterol

by 18 to 22 units.

In this situation, the 400 patients are a sample of all patients who may be treated with the

drug. The confidence interval of 18 to 22 is a quantitative measure of the uncertainty – the possible

difference between the true average effect of the drug and the estimate of 20 mg/dL.

a) Standard error of the mean for one sample:

For a sample of size n and standard deviation σ, the SEM is given by,

𝑆𝐸𝑀 =𝜎

𝑛 where, σ = S.D. of sample

b) Standard error of difference between two sample means:

If 𝑥 1 and 𝑥 2 are means of two samples with sizes 𝑛1 and 𝑛2 along with respective standard

deviations 𝜎1 and 𝜎2 , then SEM is given by

Page 132: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

124

𝑆𝐸𝑀(𝑥 1−𝑥 2) = 𝜎1

2

𝑛1+𝜎2

2

𝑛2

Standard error of the proportion (SEP):

It is similar to the standard deviation used in describing the dispersion of data about mean.

a) Standard error of proportion for one sample:

If samples of same size n are repeatedly randomly drawn from a population and the

proportion of sample is recorded as 𝑝 then the standard error of proportion is given by,

𝑆𝐸𝑃 = 𝑝 (1−𝑝 )

𝑛

b) Standard error for difference between two sample proportions:

If 𝑝1 and 𝑝2 are two sample proportions of different or same populations with the

respective sizes 𝑛1 and 𝑛2 , then SEP is given by

𝑆𝐸𝑃(𝑝1−𝑝2) = 𝑝1(1−𝑝1)

𝑛1+𝑝2(1−𝑝2)

𝑛2

Degree of freedom (df):

The concept of degree of freedom is central to the principal of estimating statistics of

population from the samples drawn. It represents how many values in a calculation have freedom

to vary. It can be calculated to ensure the statistical validity of chi-square test, t-test and the more

advanced f-test. These tests are commonly used to compare observed data with the data that would

be expected to obtain according to specific hypothesis.

Degrees of freedom (df) are broadly defined as the number of observations in the data

that are free to vary when estimating statistical parameters.

The statistical formula to determine degrees of freedom is simple and given by,

𝑑𝑓 = 𝑛 − 1 where, n = sample size.

We also need to calculate the degrees of freedom for the difference between sample means.

When we assume that the population variances are equal or when both sample sizes 𝑛1 and 𝑛2 are

larger than 50 we use the following formula

𝑑𝑓 = 𝑛1 + 𝑛2 − 2

For example, imagine you are a fun loving person who loves to wear hats and you don’t care

about the degree of freedom for wearing different hats. Unfortunately, you have only 7 hats. Yet you

want to wear a different hat everyday of the week.

On the 1st day, you can wear any of 7 hats. On 2nd day, you can choose from remaining 6

hats, and so on. On 6th day, you still have 2 choices. But after 6th day, you have no choice for the hat

that you wear on Day 7. You must wear the last remaining hat.

Thus, you had 7 – 1 = 6 days of “hat” freedom- in which the hat you wore could vary.

Page 133: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

125

Statistical Inference:

Statistical inference is the process through which inferences about a population are made

based on certain statistics calculated from a sample of data drawn from that population. It is

important in order to analyze data properly. Indeed, proper data analysis is necessary to interpret

research results and to draw appropriate conclusions. In this chapter, three basic statistical

concepts are presented: estimation, confidence interval, and P-value, and these concepts are

applied to the comparisons of proportions, means.

Statistical inference is the act of generalisation or estimation about the larger sized

population from the sample information.

There are two common types of statistical inference:

Estimation

Testing of Hypothesis

Estimation:

Statistical estimation is concerned with the method by which population characteristics are

estimated from sample information. The objective of estimation is to approximate the value of a

population parameter on the basis of a sample statistic i.e. when the value of parameter is unknown

then it can be estimated on the basis of a random sample.

For example, the sample mean 𝑥 is used to estimate the population mean 𝜇.

There are two types of estimation:

(1) Point Estimation (likely value for parameter)

(2) Interval Estimation (also called confidence interval for parameter)

(1) Point Estimates:

A point estimator of an unknown parameter of a population is a single value or point of

sample statistic. It is always provided with its standard error which is a measure of uncertainty

associated with estimation process.

For example, the sample mean 𝑥 is point estimate of the population mean 𝜇.

(2) Interval Estimates:

An interval estimate is defined by two numbers, between which value of population

parameter is said to lie. Here, we try to construct an interval that covers the true population

parameter with a specified probability.

For example, 𝑎 < 𝑥 < 𝑏 is an interval estimate of population mean 𝜇. It indicates that

population mean is greater than 𝑎 but less than b.

Page 134: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

126

Characteristics of Estimators:

The desirability of an estimator is judged by its characteristics. There are three important

criteria:

(i) Unbiasedness:

An unbiased estimator of a population parameter is an estimator whose expected value is

equal to the parameter value.

i.e. an estimator say 𝜃 for population parameter 𝜃 is said to be unbiased if

𝐸 𝜃 = 𝜃

For example, the sample mean 𝑥 is unbiased estimator of the population mean 𝜇, since

𝐸 𝑥 = 𝜇

(ii) Consistency:

An unbiased estimator is said to be consistent if the difference between the estimator and

the target population parameter becomes smaller as we increase the sample size.

i.e. an unbiased estimator say 𝜃 for parameter 𝜃 is said to be consistent if

𝑉 𝜃 → 0 as n → ∞.

Note that being unbiased is a precondition for an estimator to be consistent.

For example, variance of the sample mean 𝑥 is 𝜎2

𝑛 , which decreases to zero as we

increase sample size n.

(iii) Efficiency:

If we are given two unbiased estimators for a population parameter then the estimator with

a smaller variance is more efficient.

For example, for a normally distributed population, it can be shown that the sample median

is an unbiased estimator for µ. It can also be shown, however, that the sample median has a greater

variance than that of the sample mean, for the same sample size. Hence 𝑥 is a more efficient

estimator than sample median.

Confidence Interval:

In statistics, confidence interval is used to describe the amount of uncertainty associated

with a sample estimate of population parameter. It is an interval estimate combined with a

probability statement. A confidence interval is an interval within which the true value of

population parameter lies. The width of this confidence interval depends on the properties of

population and the degree of probability considered.

Page 135: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

127

Here we are going to construct confidence intervals for population proportion (p) and

population mean (𝜇) using sample proportion (𝑝 ) and sample mean (𝑥 ) as point estimate which will

be the centre of the confidence interval. The width of confidence interval will depend on two things:

i. Level of confidence

ii. Standard error

Confidence level refers to the percentage of all possible samples that can be expected to

include the true population parameter. For example 95% confidence level implies that 95% of

confidence level would include the true population parameter. It is considered as the multiplier in

confidence interval.

The general form of confidence interval is

𝑃𝑜𝑖𝑛𝑡 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 ± 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑒𝑟 (𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟)

a) Confidence Intervals for Population Proportions (p):

It is constructed by taking sample proportion (𝑝 ) as point estimate with sample size n and

standard error of sample proportion. Here 𝒛 is the level of confidence find from z table and taken as

multiplier.

𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑜𝑓 𝑝 = 𝑝 ± 𝑧 𝑝 (1− 𝑝 )

𝑛

For example, in a survey, people are asked how many of them wear seatbelt while driving.

In a sample of 1356 males, 677 said that they wear seatbelt while driving.

𝑝 =677

1356= 0.499

Let’s construct 95% confidence level for the population proportion from which the sample

proportion was drawn.

To compute confidence interval we need z multiplier and standard error.

From z table, for 95% confidence level, multiplier z = 1.96

Hence, 𝑆𝐸 = 𝑝 (1−𝑝 )

𝑛=

0.499(1−0.499)

1356= 0.136

Thus, 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 = 0.499 ± 1.96 0.136 = 0.499 ± 0.027 = [0.472, 0.526]

i.e. We are 95% confident that the population proportion of males who wear seat belt while driving

is between 0.472 and .

b) Confidence Intervals for Population Mean (𝝁):

It is constructed by taking sample mean (𝑥 ) as point estimate with sample size n and

standard error of sample mean. Here t is level of confidence find from t table and taken as

multiplier.

𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑓𝑜𝑟 𝜇 = 𝑥 ± 𝑡𝜎

𝑛

Page 136: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

128

For example, in a class survey, students are asked how many hours they sleep per night. In a

sample of 22 students, the mean was 5.77 hours with a standard deviation of 1.572 hours.

Let’s construct a 95% confidence level for the mean number of hours slept per night in the

population from which sample was drawn.

To compute confidence interval we need t multiplier and standard error.

From t table, with degree of freedom 22 – 1 = 21 and for 95% confidence level,

multiplier t = 2.08

Hence, 𝑆𝐸 =𝜎

𝑛=

1.572

22= 0.335

Thus,

𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 = 5.77 ± 2.08 0.335 = 5.77 ± 0.697 = [5.073, 6.467]

i.e. We are 95% confident that the population mean hours of students sleeping per night is between

5.073hours and 6.467 hours.

P - value:

P-value is the probability for the given statistical model in hypothesis testing to support or

reject the null hypothesis. It is the evidence against null hypothesis. At the time of hypothesis

testing, a p-value helps to determine the significance of the result. The p-value is a number between

0 and 1 and interpreted in the following way:

A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so as

to reject the null hypothesis.

A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to

reject the null hypothesis.

A p-value very close to the cut off (0.05) is considered to be negligible.

Page 137: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

129

7. Testing of Hypothesis

Introduction:

In this chapter, we are going to study second type of statistical inference in the form of

hypothesis testing by using various statistical methods and probability distributions; the first one

was confidence intervals. The main purpose of this study is to make decisions and draw inferences

about the available data of population using samples of that population. In pharmaceutical studies,

the purpose is often to demonstrate that “a new drug is effective, or possibly to show that it is more

effective than the existing drug”. While in clinical trials, the purpose is to demonstrate that “the new

drug is better than a placebo control”. In this chapter, we will focus only on numeric outcomes. This

chapter also introduces the critical (or rejection) region approach to hypothesis testing and

compares it to critical value (p-value) approach.

A statistical measure such as mean, standard deviation or variance which describes

population is known as parameter.

Estimation:

The statistical estimation is one of the main objectives or methods of statistics in which

conclusions about a population are drawn and/or decisions are taken from the analysis of the

sample drawn from that population. Statistical inference includes:

1. Estimation theory

2. Tests of hypothesis

3. Non Parametric tests

4. Sequential analysis

In estimation theory, we estimate the unknown value of the population parameter based on

sample observations i.e. the statistical measure calculated from sample is assigned to the

population of that sample.

E.g. suppose we are given a sample of weights of 100 students in a school. So it is possible to

estimate the average weight of all students in that school using the weights of these 100 students.

But, in general, there is difference between the value calculated from sample and the

corresponding value of population. This difference is called Sampling error.

Tests of Hypothesis:

The sample is assumed to be a small representative of the total population. But in many

experiments, it happens that the sample is not a whole representative of population from which it is

selected. So the statistical conclusions and estimations about population go wrong. In such case,

certain statements are tested about population parameter to come to the conclusions like whether

or not the difference between sample value and parametric value is due to chance or otherwise.

Page 138: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

130

This whole procedure is called testing of hypothesis. In this chapter, we will find the difference

between sample mean and population mean i.e. both are equal or not.

A hypothesis, in statistics, is a statement about a population which is supposed to be true

till it is proved to be false. It should be stated before conducting the statistical tests of hypothesis. It

is set up in two ways:

a) Null Hypothesis

b) Alternative Hypothesis

a) Null Hypothesis:

Null hypothesis is a statement which is actually tested for acceptance or rejection. It is

stated under the assumption that “there is no significance difference” between sample result and

population result. We assume that null hypothesis is true but in pharmaceutical research we wish

to prove it false.

Null hypothesis is generally denoted by H0.

b) Alternative Hypothesis:

When the null hypothesis is rejected then it is required to accept another statement called

as alternative hypothesis. It is research hypothesis which is generally believed to be true by

researcher.

Alternative is generally denoted by Ha or H1.

Note: The hypothesis we want to test is “likely” true. So there are two possible outcomes:

Reject H0 and accept Ha because of sufficient proofs in favour of Ha.

Accept H0 because of insufficient proofs to support Ha.

Failure to reject H0 does not mean that null hypothesis is true. It only means that we do not have

sufficient evidence to support H1.

Elements necessary for Hypothesis testing:

1. Level of significance:

The level of significance, denoted by 𝛼, is the probability of rejecting the null hypothesis

when it is true. For example, a significance level 0.05 indicates a 5% risk of concluding that a

difference exists when there is no actual difference.

Similarly, the p-value is the strength or probability of accepting null hypothesis. It is used to

compare with the test statistic value. If the calculated statistic value is less than the given p-value

at 𝛼% level of significance accept the null hypothesis; otherwise reject null hypothesis.

2. Region of Acceptance and Rejection:

It is a range of values which leads to accept the null hypothesis while the set of values where

null hypothesis is rejected is called area of rejection. Area of rejection is also known as Critical

Region. The values which separate the critical region from the region of acceptance are called

critical values.

Page 139: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

131

3. Power of test:

Power of test of any statistical significance is the probability of rejecting null hypothesis

when it is false. It ranges from 0 to 1. The power quantifies the chance that the null hypothesis

will be rejected when it is actually false. Thus, power is the ability of a test to correctly reject the

null hypothesis. Although a hypothesis test without it is conducted, calculating the power of a test

beforehand will help to ensure that the sample size is large enough for the purpose of the test.

Types of error:

No hypothesis test is 100% correct, as it is based on probabilities. At the time of testing of

hypothesis, null hypothesis may be accepted or may be rejected. Depending on this acceptance or

rejection there are two types of errors:

a) Type one (I) error

b) Type two (II) error

a) Type I error:

Rejection of null hypothesis when it is true, is called type I error. The probability of

committing type I error is 𝛼 i.e. level of significance which is set up before testing hypothesis. Given

𝛼 of 0.05 means there is 5% chance that we are wrong to reject null hypothesis. To decrease the

chance of error, use a lower value of 𝛼. Using lower value of 𝛼 means the less likely to detect a true

difference if exists.

b) Type II error:

Accepting null hypothesis when it is false is called type II error. The probability of making

type II error is 𝛽 i.e. power of test. To decrease the chance of error ensure that the test must have

enough power i.e. sample size should be large enough to detect practical difference if exist.

One Tailed test and Two Tailed test:

In statistics hypothesis testing, we need to judge whether it is a one-tailed or a two-tailed

test so that we can find the critical values in tables such as Standard Normal z Distribution Table

and t Distribution Table which are the standard normal values. z curve and t curve are generally in

bell shape and symmetric about the vertical axis. Each side of the curve represents a tail which is

rejection area for hypothesis.

One –tailed test: A test of a statistical hypothesis, where the region of rejection is on only one side

of the sampling distribution, is called a one-tailed test.

Two-tailed test: A test of a statistical hypothesis, where the region of rejection is on both sides of

the sampling distribution, is called a two-tailed test.

Steps to perform Hypothesis testing:

All hypothesis tests are conducted by the same way.

1) State the hypotheses.

2) Formulate an analysis plan: Choose significance level and test statistic

Page 140: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

132

3) Analyse sample data: Perform calculations.

4) Interpret result: Compare table value with calculated value.

z-test:

A z-test is a statistical test used to determine whether two population means are different

when the variances are known and the sample size is large. The test statistic is assumed to have

normal distribution and standard deviation should be known for an accurate z-test to be

performed. It is basically used for dealing with problems relating to large samples when 𝑛 ≥ 30.

There are different types of z-test each for different purpose. Some of the popular types are

outlined below:

a) For comparing two proportions:

Let 𝑝 1 and 𝑝 2 be two proportions with the respective sample sizes 𝑛1 and 𝑛2

To test:

𝐻0 : 𝑝 1 = 𝑝 2

Against:

𝐻1 : 𝑝 1 ≠ 𝑝 2

Test statistic:

𝑧 = 𝑝 1 − 𝑝 2

𝑝 1 − 𝑝 (1𝑛1−

1𝑛2

)

Where p is the probability of successes in two samples combines and given by

𝑝 =𝑛1𝑝 1 + 𝑛2𝑝 2𝑛1 + 𝑛2

At 𝛼% level,

If 𝐶𝑎𝑙 𝑧 ≤ 𝑇𝑎𝑏 𝑧 , then Accept 𝐻0 .

Otherwise, reject 𝐻0 .

Ex.1. In random samples of 600 and 1000 men from two cities, 400 and 600 men are found to

be literate. Do the data indicate that the populations are significantly different in the percentage

of literacy?

[Given at 5% level of significance, 𝑧 = 1.96 ]

Ans: Let 𝑝 1 = Proportion of literate men from one city

𝑝 2 = Proportion of literate men from another city

Here, 𝑝 1 =400

600 𝑛1 = 600 𝑎𝑛𝑑 𝑝 2 =

600

1000 𝑛2 = 1000

Page 141: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

133

To test:

𝐻0 : 𝑝 1 = 𝑝 2

Against:

𝐻1 : 𝑝 1 ≠ 𝑝 2

Test statistic:

𝑧 = 𝑝 1 − 𝑝 2

𝑝 1 − 𝑝 (1𝑛1−

1𝑛2

)

Where

𝑝 =𝑛1𝑝 1 + 𝑛2𝑝 2𝑛1 + 𝑛2

=600 ×

400600

+ 1000 ×600

1000600 + 1000

=400 + 600

1600= 0.624

Hence,

𝑧 = 0.67− 0.6

0.624 0.376 (1

600−1

1000)

= 4.76

𝐶𝑎𝑙 𝑧 = 4.76

At 5% level, 𝑇𝑎𝑏 𝑧 = 1.96

𝐶𝑎𝑙 𝑧 > 𝑇𝑎𝑏 𝑧 ,

So reject 𝐻0 .

Therefore, there is significant difference in the percentage of literacy of two cities.

b) For one sample mean:

Let 𝑥 be the sample mean and 𝜇 be population mean with 𝜎 as standard deviation and 𝑛 as

sample size.

To test:

𝐻0 : 𝑥 = 𝜇

Against:

𝐻1 : 𝑥 ≠ 𝜇

Test statistic:

𝑧 = 𝑥 − 𝜇 𝜎 𝑛

Where,

𝜎 = (𝑥𝑖 − 𝑥 )

2

𝑛

At 𝛼% level,

If 𝐶𝑎𝑙 𝑧 ≤ 𝑇𝑎𝑏 𝑧 , then Accept 𝐻0 .

Otherwise, reject 𝐻0 .

Page 142: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

134

Ex.2. The mean plasma potassium level of 50 adult males with a certain disease was found to be

3.356mEq/litre and the S.D. was 0.5mEq/litre. The normal adult value of plasma potassium is

4.6mEq/litre. Based on above data, can it be concluded that the males with diseases have lower

plasma potassium level than normal level? (Given at 5%, 𝑧 = 1.96)

Ans: Here, 𝑥 = 3.356 𝜎 = 0.5 𝑛 = 50

To test:

𝐻0 : 𝜇 = 4.6

Against:

𝐻1 : 𝜇 ≠ 4.6

Test statistic:

𝑧 = 𝑥 − 𝜇 𝜎 𝑛

= 3.35− 4.6

0.5 50

=1.25

0.07

= 17.675

At 5% level, 𝑇𝑎𝑏 𝑧 = 1.96

𝐶𝑎𝑙 𝑧 > 𝑇𝑎𝑏 𝑧 ,

Hence, reject 𝐻0 .

Thus, the males with diseases have lower plasma potassium level than normal males.

Ex.2. A machine produces metal plates of thickness 1.5cm with S.D. 0.2cm. A sample of 100 plates

produced by machine has an average thickness of 1.52cm. Is the machine fulfilling the purpose for

which it is designed?

Ans: Here, 𝑥 = 1.52 𝜎 = 0.2 𝑛 = 100

To test:

𝐻0 : 𝜇 = 1.5

Against:

𝐻1 : 𝜇 ≠ 1.5

Test statistic:

Page 143: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

135

𝑧 = 𝑥 − 𝜇 𝜎 𝑛

= 1.52− 1.5

0.2 100

=0.02

0.02

= 1

At 5% level, 𝑇𝑎𝑏 𝑧 = 1.96

𝐶𝑎𝑙 𝑧 ≤ 𝑇𝑎𝑏 𝑧 ,

Hence, accept 𝐻0 .

Thus, the machine is fulfilling its purpose.

Ex.3. Six bottles from batch of suspension were assayed for paracrtamol content by

spectrophotometric method. Each 5ml suspension contains 500, 503, 509, 515, 502, 507 mg of

paracetamol. Test the hypothesis that the average content of paracetamol is 505mg.

Ans: 𝑥 = 𝑥𝑖

𝑛=

500+503+509+515+502+507

6= 506

To test:

𝐻0 : 𝜇 = 505

Against:

𝐻1 𝜇 ≠ 505

Test statistic:

𝑧 = 𝑥 −𝜇 𝜎 𝑛

Where, 𝜎 = (𝑥𝑖−𝑥 )

2

𝑛

To find S.D.,

𝑥𝑖 (𝑥𝑖 − 𝑥 ) (𝑥𝑖 − 𝑥 )2

500 -6 36

503 -3 9

509 3 9

515 9 81

502 -4 16

507 1 1

Total 152

Page 144: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

136

𝜎 = (𝑥𝑖−𝑥 )

2

𝑛=

152

6= 5.033

Hence,

𝑧 = 𝑥 − 𝜇 𝜎 𝑛

= 506− 505

5.033 6

= 0.4866

At 5% level, 𝑇𝑎𝑏 𝑧 = 1.96

𝐶𝑎𝑙 𝑧 ≤ 𝑇𝑎𝑏 𝑧 ,

So accept 𝐻0 .

c) For two sample means:

Let 𝑥 1 and 𝑥 2 be two sample means with standard deviations 𝜎1 and 𝜎2 , sample sizes 𝑛1 and

𝑛2 respectively.

To test:

𝐻0 : 𝑥 1 = 𝑥 2

Against:

𝐻1 : 𝑥 1 ≠ 𝑥 2

Test statistic:

𝑧 = 𝑥 1 − 𝑥 2

𝑆.𝐸.

Where, Standard Error is

𝑆.𝐸. = 𝜎1

2

𝑛1+𝜎2

2

𝑛2

At 𝛼% level,

If 𝐶𝑎𝑙 𝑧 ≤ 𝑇𝑎𝑏 𝑧 , then Accept 𝐻0 .

Otherwise, reject 𝐻0 .

Ex.1. Random samples drawn from two places gave the following data relating to the wing length of

anopheles mosquitoes. Test at 5% level that the mean wing length is the same for mosquitoes at

two places.

(Given 𝑧 = 1.96)

Place ‘A’ Place ‘B’

Mean 3.60 3.58

S.D. 1.8 1.6

size 50 50

Ans: Here, 𝑥 1 = 3.60 𝑥 2 = 3.58

Page 145: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

137

𝜎1 = 1.8 𝜎2 = 1.6

𝑛1 = 50 𝑛2 = 50

To test:

𝐻0 : 𝑥 1 = 𝑥 2

Against:

𝐻1 : 𝑥 1 ≠ 𝑥 2

Test statistic:

𝑧 = 𝑥 1 − 𝑥 2

𝑆.𝐸.

Where,

𝑆.𝐸. = 𝜎1

2

𝑛1+𝜎2

2

𝑛2=

1.82

50+

1.62

50= 0.116 = 0.34

Hence,

𝑧 = 3.60− 3.58

0.34= 0.058

At 5% level, 𝑇𝑎𝑏 𝑧 = 1.96

𝐶𝑎𝑙 𝑧 ≤ 𝑇𝑎𝑏 𝑧 ,

So accept 𝐻0 .

Thus, the mean wing length is the same for mosquitoes at two places.

Ex.2. In two groups of infants in 6 months of age the following values were observed:

Group No. Of infants Mean weight S.D.

1 100 6.9kg 1.10kg

2 169 7.3kg 0.91kg

Test whether mean birth weights are significantly different at 5% level.

Ans: Here, 𝑥 1 = 6.9 𝑥 2 = 7.3

𝜎1 = 1.10 𝜎2 = 0.91

𝑛1 = 100 𝑛2 = 169

To test:

𝐻0 : 𝑥 1 = 𝑥 2

Against:

𝐻1 : 𝑥 1 ≠ 𝑥 2

Page 146: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

138

Test statistic:

𝑧 = 𝑥 1 − 𝑥 2

𝑆.𝐸

Where,

𝑆.𝐸. = 𝜎1

2

𝑛1+𝜎2

2

𝑛2=

1.102

100+

0.912

169= 0.017 = 0.13

Hence,

𝑧 = 6.9− 7.3

0.13= 3.077

At 5% level, 𝑇𝑎𝑏 𝑧 = 1.96

𝐶𝑎𝑙 𝑧 > 𝑇𝑎𝑏 𝑧 ,

So reject 𝐻0 .

Thus, the mean birth weights are significantly different at 5% level.

Student’s t-Test:

Unfortunately, z-tests require one of two conditions: either the population is normally

distributed with a known variance, or the sample size is large.

In general hypothesis testing becomes very difficult if for normal population, sample size is

small with unknown population variance. In such case, student’s t-test is applied. A t-test is an

analysis of two population means through the use of statistical examination. A t-test looks at the t-

statistic, the t-distribution and degrees of freedom to determine the probability of difference

between populations.

Types of Student’s t-test:

There are two types of student’s t-test:

1. Paired t-test

2. Unpaired t-test

1. Paired t-test: It is used to compare the means of two populations when the data is paired. It is

also used in case “Before-After” of same individual.

2. Unpaired t-test: It is used to compare the means of the two independent groups of data and

determines whether the data has come from the same population or not

a) T-test for small sample size (𝒏 ≤ 𝟑𝟎)to compare equality of sample mean and population

mean:

Let 𝑥 be sample mean of size 𝑛 and 𝜇 be population mean along with the standard

deviation 𝜎 .

To test:

Page 147: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

139

𝐻0:𝑥 = 𝜇

Against:

𝐻1:𝑥 ≠ 𝜇

Test statistic:

𝑡 = 𝑥 − 𝜇 𝜎 𝑛 − 1

Where,

𝜎 = (𝑥𝑖−𝑥 )

2

𝑛−1 𝑛 − 1 is degree of freedom

At 𝛼% level,

If 𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 , then Accept 𝐻0 .

Otherwise, reject 𝐻0 .

Ex.1. A random sample of 20 sachets of powder containing certain drug gives mean API content of

42 mg and S.D. of 6mg. Test the hypothesis that the population mean is 44 mg.

(Given at 5% level and 19 d.f., 𝑇𝑎𝑏 𝑡 = 2.093)

Ans.: Given 𝑛 = 20 𝑥 = 42𝑚𝑔 𝜎 = 6𝑚𝑔

To test:

𝐻0:𝜇 = 44𝑚𝑔

Against:

𝐻1:𝜇 ≠ 44𝑚𝑔

Test statistic: Two tailed test

𝑡 = 𝑥 − 𝜇 𝜎 𝑛 − 1

= 42− 44

6 19

= 1.4534

At 5% level, 𝑇𝑎𝑏 𝑡 = 2.093

𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 ,

Hence accept 𝐻0 .

Thus, the population mean is 44mg.

Page 148: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

140

b) T-test for large sample size (𝒏 > 30)to compare equality of two sample means:

Let 𝑥 1 and 𝑥 2 be two sample means of sizes 𝑛1 and 𝑛2 along with standard deviations 𝜎1 and

𝜎2 res/pectively..

To test:

𝐻0:𝑥 1 = 𝑥 2

Against:

𝐻1:𝑥 1 ≠ 𝑥 2

Test statistic: Unpaired t-test

𝑡 = 𝑥 1 − 𝑥 2

𝜎 1𝑛1

+1𝑛2

Where,

𝜎 = 𝑛1𝜎1

2+𝑛2𝜎22

𝑛1+𝑛2−2 𝑛1 + 𝑛2 − 2 is degree of freedom

At 𝛼% level,

If 𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 , then Accept 𝐻0 .

Otherwise, reject 𝐻0 .

Ex.2. One lab check shelf life of formulation obtained from two different manufacturers (generic

product). Data is given in table. Check whether is their existence of any significant difference in

same kind of product but of different manufacturer.

(Given at 5% level and for 25 df, 𝑇𝑎𝑏 𝑡 = 2.06)

Product Mean S. D. Sample Size

Brand X 2000 days 250 12

Brand Y 2230 days 300 15

Ans.: Given 𝑥 1 = 2000 𝜎1 = 250 𝑛1 = 12

𝑥 2 = 2230 𝜎1 = 300 𝑛1 = 15

To test:

𝐻0:𝑥 1 = 𝑥 2

Against:

𝐻1:𝑥 1 ≠ 𝑥 2

Test statistic: Unpaired t-test

Page 149: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

141

𝑡 = 𝑥 1 − 𝑥 2

𝜎 1𝑛1

+1𝑛2

Where,

𝜎 = 𝑛1𝜎1

2 + 𝑛2𝜎22

𝑛1 + 𝑛2 − 2

= 12(250)2 + 15(300)2

12 + 15− 2

= 84000

= 289.827

Hence,

𝑡 = 2000− 2230

289.827 1

12 +1

15

=230

112.25

= 2.0489

At 5% level, 𝑇𝑎𝑏 𝑡 = 2.06

𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 , a

So accept 𝐻0 .

Thus, two formulations are equal in average shelf life.

c) T-test for before-after condition of certain treatment:

Let 𝑥 be the ‘before’ and 𝑦 be the ‘after’ condition of the certain treatment of sample size 𝑛 and

standard deviation 𝜎 .

To test:

𝐻0:𝑥 = 𝑦

Against:

𝐻1:𝑥 ≠ 𝑦

Test statistic: Paired t-test

𝑡 = 𝑑 𝜎 𝑛

Where,

𝑑 = 𝑥 − 𝑦

𝑛, 𝑥 − 𝑦 ≥ 0 𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒 𝑡𝑎𝑘𝑒 (𝑦 − 𝑥)

Page 150: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

142

𝜎 = 𝑛 𝑑2 + ( 𝑑)2

𝑛(𝑛 − 1)

𝑛 − 1 is degree of freedom

At 𝛼% level,

If 𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 , then Accept 𝐻0 .

Otherwise, reject 𝐻0 .

Ex.3. A test of 150 marks was taken before and after training for newly joins candidates in

production department. Table contains marks obtained by candidates. Test whether there is any

change in candidates after training.

(Given at 5% level and 4 df, 𝑇𝑎𝑏 𝑡 = 4.6)

Candidate A B C D E

Marks obtained before training 110 120 123 132 125

Marks obtained after training 120 118 125 136 121

Ans.: Prepare following table for before-after condition.

Candidate x Y 𝒅 = (𝒚 − 𝒙) 𝒅𝟐

A 110 120 10 100

B 120 118 -2 4

C 123 125 2 4

D 132 136 4 16

E 125 121 -4 16

Total - - 𝑑 = 10 𝑑2 = 140

To test:

𝐻0:𝑥 = 𝑦

Against:

𝐻1:𝑥 ≠ 𝑦

Test statistic: Paired t-test

𝑡 = 𝑑 𝜎 𝑛

Where,

𝑑 = 𝑑

𝑛=

10

5= 2

𝜎 = 𝑛 𝑑2 + 𝑑 2

𝑛 𝑛 − 1

Page 151: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

143

= 5 140 + 10 2

5 5− 1

= 30

= 5.4

Hence,

𝑡 =2

5.4 5

=2

2.41

= 0.829

At 5𝛼% level, 4 df, 𝑇𝑎𝑏 𝑡 = 4.6

𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 ,

So, accept 𝐻0

Thus, training has not shown any significant effect on scores.

Ex.4. Applications of fertilizers were tested for the yield of rice grown in 10 plots. Another seed of

10 plots of similar size & condition were taken as control. Test the effect of fertilizer.

(Given for 9 df, 𝑡0.05 = 2.10)

Ans.: Prepare following table for following condition.

x Y 𝒅 = (𝒙 − 𝒚) 𝒅𝟐

16 10 6 36

14 12 2 4

18 11 7 49

15 9 6 36

13 13 0 0

17 13 4 16

16 12 4 16

15 14 1 1

14 13 1 1

13 11 2 4

- 𝑑 = 33 𝑑2 = 163

To test:

𝐻0:𝑥 = 𝑦

Against:

𝐻1:𝑥 ≠ 𝑦

Fertilizer applied 16 14 18 15 13 17 16 15 14 13

Fertilizer Not applied 10 12 11 9 13 13 12 14 13 11

Page 152: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

144

Test statistic: Paired t-test

𝑡 = 𝑑 𝜎 𝑛

Where,

𝑑 = 𝑑

𝑛=

33

10= 3.3

𝜎 = 𝑛 𝑑2 + 𝑑 2

𝑛 𝑛 − 1

= 10 163 + 33 2

10 10− 1

= 30.21

= 5.5

Hence,

𝑡 =3.3

5.5 10

=3.3

1.74

= 1.896

At 5𝛼% level, 4 df, 𝑇𝑎𝑏 𝑡 = 4.6

𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 ,

So, accept 𝐻0

Thus, there is no effect of fertilizers on the yield of rice.

F-test:

A statistical F-test is derived from Student’s t-test. It is used to compare equality of two

variances by dividing them with each other. The larger variance is taken at numerator to result the

test into right tailed test as it is easier to calculate.

To calculate F value:

Let 𝜎12 and 𝜎2

2 be the two sample variances with sample sizes 𝑛1 and 𝑛2 respectively.

To test:

𝐻0:𝜎12 = 𝜎2

2

Against:

𝐻1:𝜎12 ≠ 𝜎2

2 𝑜𝑟 𝜎12 > 𝜎2

2 𝑜𝑟 𝜎12 < 𝜎2

2

Test statistic: F-test

𝐹 =𝜎1

2

𝜎22

𝑓𝑜𝑟𝜎12 > 𝜎2

2

Page 153: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

145

=𝜎2

2

𝜎12

𝑓𝑜𝑟𝜎12 < 𝜎2

2

Where,

𝜎12 =

(𝑥𝑖 − 𝑥 )2

𝑛1 − 1 𝑎𝑛𝑑 𝜎2

2 = (𝑦𝑖 − 𝑦 )2

𝑛2 − 1

At 𝛼% level,

If 𝐶𝑎𝑙 𝐹 ≤ 𝑇𝑎𝑏 𝐹 , then Accept 𝐻0 .

Otherwise, reject 𝐻0 .

Ex.1. Two samples are drawn from two populations. From the data given below test whether the

two samples have same variance at % level of significance. (Given at 5% F8,7 = 3.76)

total

Sample I (x) 60 65 71 74 76 82 85 87 600

Sample II (y) 61 66 67 85 78 63 85 88 91 684

Ans: To test:

𝐻0:𝜎12 = 𝜎2

2

Against:

𝐻1:𝜎12 ≠ 𝜎2

2

Here,

𝑥 = 𝑥𝑖

𝑛1=

600

8= 75

𝑦 = 𝑦𝑖

𝑛2=

684

8= 76

𝒙𝒊 (𝒙𝒊 − 𝒙 )𝟐 𝒚𝒊 (𝒚𝒊 − 𝒚 )𝟐

60 225 61 225

65 100 66 100

71 16 67 81

74 1 85 81

76 1 78 4

82 49 63 169

85 100 85 81

87 144 88 144

91 225

Total 636 Total 1110

Page 154: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

146

𝜎12 =

(𝒙𝒊 − 𝒙 )𝟐

𝑛1 − 1=

636

8− 1=

636

7= 90.85

𝜎22 =

(𝒚𝒊 − 𝒚 )𝟐

𝑛2 − 1=

1110

9− 1=

1110

8= 138.75

Here, 𝜎12 < 𝜎2

2

𝐹 =𝜎2

2

𝜎12 =

138.75

90.85= 1.507

At 5% level, 𝑇𝑎𝑏 𝐹 = 3.76

𝐶𝑎𝑙 𝐹 = 1.507

𝐶𝑎𝑙 𝐹 ≤ 𝑇𝑎𝑏 𝐹 ,

Accept 𝐻0 .

Therefore there is no significant difference between the variances of two samples.

Ex.2. Consider the two methods of measuring particulate matter in water and assume that we want

to find out if one is more precise than the other. Precision is measured by the fact that a more

precise method shall have a lower standard deviation and hence a lower variance. The data

obtained for 10 tests each for the two methods are given below:

Method Mean S.D. Variance

A 10.3 0.6 0.36

B 11 0.7 0.49

(Given for 9 degree of freedom and 5% level, 𝑡𝑎𝑏 𝐹 = 3.179)

Ans: To test:

𝐻0:𝜎12 = 𝜎2

2

Against:

𝐻1:𝜎12 ≠ 𝜎2

2

Here,

𝜎12 = 0.36 and 𝜎2

2 = 0.49

𝜎12 < 𝜎2

2

𝐹 =𝜎2

2

𝜎12 =

0.49

0.36= 1.3611

At 5% level, 𝑡𝑎𝑏 𝐹 = 3.179

𝑐𝑎𝑙 𝐹 < 𝑡𝑎𝑏 𝐹

Accept Ho.

Therefore the standard deviations are not different and hence neither of the methods are

precise than the other.

Page 155: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

147

Exercise

1. In a sample of 400 parts manufactured by a factory, the number of defective parts was

found to be 30. The company, however, claimed that only 5% of their product is defective. Is

the claim tenable? (Given at 5%, 𝑧 = 1.96)

2. The mean life of a sample of optical lenses produced by a company is computed to be

1570hrs with S.D. of 120hrs after which it is presumed that the lenses do not maintain

accuracy. The company claims that the average life of the lenses produced by it is 1600hrs.

Using the level of significance of 0.05 can you say the claim ia acceptable?

3. In an investigation on Neonatal Blood Pressure in relation to maturity, the following results

were obtained:

Babies 9 days old Number Mean S.B.P S.D.

1. Normal 54 75 6

2. Neonatal asphyxia 14 69 5

Is the difference in mean S.B.P. between the two groups significant at 5 level?(Given at 5%

level, 𝑧 = 1.96)

4. Intelligence test on two groups of boys abd girls gave the following results.

Group Mean S.D. n

1. Boys 75 15 150

2. Girls 70 20 150

Is there any significant difference in the mean scores obtained by boys and girls at 5% level?

(Use z-test)

5. For a random sample of 10 persons fed on diet A, the increase in weight in pounds in a

certain period were : 10, 6, 16, 17, 13, 12, 8, 14, 15, 9

For another random sample of 12 persons fed on diet B the same is given as: 7, 13, 22, 15,

12, 14, 18, 8, 21, 23, 10, 17

Test whether the diets A and B are different as regards their effect on increase in weight.

6. Life time of batteries for a random sample of 10 from a large consignment gave the

following data. Can we accept the hypothesis that average life time of battery is 400 hrs?

(Note: use t-test at 5% level for df = 9).

Battery 1 2 3 4 5 6 7 8 9 10

Life (in hrs X 100) 5.6 4.2 4.6 4.1 5.2 3.8 3.9 4.3 4.4 3.9

7. Two laboratories M and N carry out independent estimation of fat in ice-cream made by

same industry. A sample is taken from each batch and their observations are given in table

below. Is there any significant difference between the mean fat content obtained by two labs

M and N?

Page 156: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

148

Batch No.: 1 2 3 4 5 6 7 8 9 10

Lab M 6 6 9 6 7 6 7 8 8 4

Lab N 7 6 7 8 6 5 7 7 9 4

8. Ten incubation periods of polio cases are given below. Discuss the mean by using t-test.

Days: 66 69 70 65 69 71 70 68 63 62.

9. A drug given for treatment to 10 patients suffering from diabetes showed change in blood

pressure as given in table. Is it reasonable to believe that drug has no side effect as change

in B.P. at 5% level.

125 130 120 140 135 125 120 140 135 125

10. The weights at birth of female children born in hospital are found to be in ‘kg’. Is there

anything that can suggest the mean of the weight of children any significant from the

population mean 3 kg?

2.5 3.0 2.5 3.0 3.2 3.5 2.5 3.1 2.9 3.5

Page 157: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

149

8. ANOVA

Introduction:

In previous chapter, we learned to use t-test for the testing of equality of means of two

population based on data from two independent samples. The t-test and z-test developed in the

20th century were used until 1918, when Ronald Fisher created the analysis of variance (ANOVA)

which is the extension of the t-test and the z-test. In this chapter, we are going to test the equality of

means of three or more population. This comparison of two or more means is based on the

distribution of variation into its dependent components- hence the method is called analysis of

variance. This method was introduced by Sir Ronald A. Fisher and has been used in many research

fields.

ANOVA is a statistical tool which is used to test if the means of three or more population are

significantly different from each other when variances are unknown. It checks the impact of one or

more factors by comparing the means of samples.

Assumptions to use the ANOVA:

To use the ANOVA test we made the following assumptions:

Each group sample is drawn from a normally distributed population.

All populations have a common/same variance.

Within each sample, the observations are sampled randomly and independently of each

other.

The sample sizes for the groups are equal and greater than 10

Factor effects are additive

Types of ANOVA:

There are two types of ANOVA:

1. One-way ANOVA(unidirectional)

2. Two-way ANOVA

1. One-way ANOVA:

The one-way analysis of variance (ANOVA) is generally used to determine whether there

are any statistically significant differences between the means of three or more independent

(unrelated) groups using F-distribution. It has only one independent variable affecting dependant

variable so it is also known as One factor analysis of variance. The null hypothesis for the test is that

the two means are equal. Therefore, a significant result means that the two means are unequal.

Limitations of the One Way ANOVA

A one way ANOVA will tell that at least two groups were different from each other. But it is

unable to tell that which groups were different.

Page 158: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

150

2. Two-way ANOVA:

A two-way ANOVA is an extension of one-way ANOVA in which there are two independent

factors affecting dependant variable. It is mostly used when there is quantitative as well as

qualitative data.

In two-way ANOVA, two null hypotheses are tested if one observation is placed in each cell.

i.e. the hypotheses would be:

H01: For column factor.

H02: For row factor.

ANOVA Table:

This is the table that shows the output of the ANOVA analysis and whether there is a

statistically significant difference between our group means. The tabular arrangement of source,

sum of squares, degree of freedom, mean sum of square (MSS) and F-ratio is called ANOVA table.

The word "source" stands for source of variation. Some authors prefer to use "between" and

"within" instead of "treatments" and "error", respectively.

Steps to perform ANOVA:

Following are the steps to perform ANOVA:

Step 1: setup null hypothesis (𝑯𝟎) for given data.

a) For one way ANOVA there should be only one null hypothesis.

b) For two way ANOVA there should be two null hypotheses.

Step 2: Find column sums i.e. 𝑐1 , 𝑐2 , 𝑐3 …

In case of two way ANOVA along with column sums, find row sums i.e. 𝑟1 , 𝑟2 , 𝑟3 ,…

Note: for large values, minimize data by subtracting smallest element of data from all observation

Step 3: Find grand total (GT):

𝐺𝑇 = 𝑐1 + 𝑐2 + 𝑐3 + … . = 𝑟1 + 𝑟2 + 𝑟3 + … .

Step 4: Find correction factor (C. F.)

𝐶.𝐹. = (𝐺𝑇)2

𝑁

Where; N = Total number of observations

Step 5: Find column sum of squares (CSS)

𝑪𝑺𝑺 = 𝒄𝟏𝟐

𝒏𝟏+ 𝒄𝟐𝟐

𝒏𝟐+ 𝒄𝟑𝟐

𝒏𝟑+⋯ − 𝑪.𝑭.

Where n1, n2, n3… are the number of observations in respective columns.

Similarly, for two way ANOVA along with CSS, find out row sum of squares (RSS).

𝑪𝑺𝑺 = 𝒓𝟏𝟐

𝒏𝟏+ 𝒓𝟐𝟐

𝒏𝟐+ 𝒓𝟑𝟐

𝒏𝟑+⋯ − 𝑪.𝑭.

Page 159: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

151

Where n1, n2, n3… are the number of observations in respective rows

Step 6: Calculate total sum of squares (TSS):

𝑻𝑺𝑺 = 𝑺𝒖𝒎 𝒐𝒇 𝒔𝒒𝒖𝒂𝒓𝒆𝒔 𝒐𝒇 𝒂𝒍𝒍 𝒐𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏𝒔 − 𝑪.𝑭.

Step 7: Calculate error sum of squares (ESS)

a) For one way ANOVA:

𝑬𝑺𝑺 = 𝑻𝑺𝑺 − 𝑪𝑺𝑺

b) For two way ANOVA

𝑬𝑺𝑺 = 𝑻𝑺𝑺 − (𝑪𝑺𝑺+ 𝑹𝑺𝑺)

Step 8: Find degree of freedom (df)

a) For one way ANOVA:

df for 𝐶𝑆𝑆 = 𝑐 − 1 (where c = No. of columns )

df for 𝐸𝑆𝑆 = 𝑐 (𝑟 − 1) (Where c and r = No. of columns and rows )

b) For two way ANOVA:

df for 𝐶𝑆𝑆 = 𝑐 − 1

df for 𝑅𝑆𝑆 = 𝑟 – 1

df for 𝐸𝑆𝑆 = (𝑐 − 1) (𝑟 − 1)

Step 9: ANOVA table

a) One way ANOVA table

Source Sum of

squares df

Mean sum of

squares (MSS) F ration

Between

columns (CSS) CSS c – 1

𝐶𝑆𝑆

(𝑐 − 1)

𝐹 = 𝑙𝑎𝑟𝑔𝑒𝑟 𝑣𝑎𝑙𝑢𝑒 𝑓𝑟𝑜𝑚 𝑀𝑆𝑆

𝑆𝑚𝑎𝑙𝑙𝑒𝑟 𝑣𝑎𝑙𝑢𝑒

ESS ESS c (r-1)

𝐸𝑆𝑆

𝑐(𝑟 − 1)

ESS/ c(r-1)

a) Two way ANOVA table

Source Sum of

squares df

Mean sum of

squares (MSS) F ration

Between

columns

(CSS)

CSS value 𝑐 − 1 𝐶𝑆𝑆

(𝑐 − 1)

𝐹1 = 𝑙𝑎𝑟𝑔𝑒𝑟 𝑣𝑎𝑙𝑢𝑒 𝑓𝑟𝑜𝑚 𝐶𝑆𝑆 𝐸𝑆𝑆

𝑆𝑚𝑎𝑙𝑙𝑒𝑟 𝑣𝑎𝑙𝑢𝑒

𝐹2 = 𝑙𝑎𝑟𝑔𝑒𝑟 𝑣𝑎𝑙𝑢𝑒 𝑓𝑟𝑜𝑚 𝑅𝑆𝑆 𝐸𝑆𝑆

𝑆𝑚𝑎𝑙𝑙𝑒𝑟 𝑣𝑎𝑙𝑢𝑒

Between

rows (RSS) RSS value 𝑟 − 1

𝑅𝑆𝑆

(𝑐 − 1)

ESS ESS value 𝑐 − 1 (𝑟

− 1)

𝐸𝑆𝑆

𝑐 − 1 (𝑟 − 1)

Step 10) Conclusion

a) One way ANOVA:

Page 160: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

152

Ex.1. Use ANOVA and determine whether the machines are significantly different in their mean

speed (Given; at 5% F2,12 = 3.89).

Machines

A1 A2 A3

25 31 24

30 39 30

36 38 28

38 42 25

31 35 38

Ans.:

1) Set up null Hypothesis;

𝐻0 = There is no significant difference in the average speed of three machines.

2) Minimize data by subtracting the smallest observation (i.e. 24) from all and calculate

column sum.

Data become;

A1 A2 A3

1 7 0

6 15 6

12 14 4

14 18 1

7 11 14

𝑐1 = 40 𝑐2 = 65 𝑐3 = 25

3) Calculate Grand Total

𝐺.𝑇. = 𝑐1+𝑐2 + 𝑐3 = 40 + 65 + 25 = 130

4) Find Correction Factor:

𝐶.𝐹. = (𝐺.𝑇.)2

𝑁=

(130)2

15= 1126.67

5) Calculate CSS

𝐶𝑆𝑆 = 𝑐1

2

𝑛1+

𝑐22

𝑛2+

𝑐32

𝑛3 − 𝐶.𝐹.

= 402

5+

652

5+

252

5 – 1126.67

= [320 + 845 + 125] – 1126.67

= 1290 – 1126.67

= 163.33

Page 161: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

153

6) Calculate TSS

𝑇𝑆𝑆 = [ 𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑎𝑙𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠] – 𝐶.𝐹.

−1126.67

= 426 + 915 + 249 – 1126.67

= 1590 – 1126.67

= 463.33

7) Find ESS

𝐸𝑆𝑆 = 𝑇𝑆𝑆 – 𝐶𝑆𝑆 = 469.33 – 163.33 = 300

8) Degree of freedom

For CSS = 𝑐 − 1 = 3− 1 = 2

For ESS = 𝑐(𝑟 − 1) = 3 (5− 1) = 12

9) One Way ANOVA table

Source Sum of squares d.f. MSS F ration

CSS 163.33 2

163.33

2

= 81.665 𝐹 = 81.665

25

= 3.266 ESS 300 12

300

12

= 25

10) Given; at 5% level of significance; 𝑇𝑎𝑏 𝐹2,12 = 3.89

𝐶𝑎𝑙 𝐹2,12 = 3.266

𝐶𝑎𝑙 𝐹 ≤ 𝑇𝑎𝑏 𝐹

So, accept 𝐻0

Thus, there is no significant difference in the average speed of three machines.

Ex.2. Prepare ANOVA table for following data and test if the varieties differ significantly among

themselves.

(Given; at 5% level of significance F8,3 = 4.07 and F3,8 = 8.83).

1 49 0

36 225 36

144 196 16

196 324 1

49 121 196

= 426 = 915 = 249

Page 162: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

154

Varieties

A B C D

20 25 24 23

29 23 20 20

21 21 22 20

Ans.:

1) Set up null Hypothesis.

𝐻0 = The four varieties do not differ significantly among themselves

2) Minimize data by subtracting the smallest observation (i.e. 20) from all and calculate

column sum.

Data become;

A1 A2 A3 A4

0 5 4 3

9 3 0 0

1 1 2 0

𝑐1 = 10 𝑐2 = 9 𝑐3 = 6 𝑐4 = 3

3) Calculate Grand Total

𝐺.𝑇. = 𝑐1+𝑐2 + 𝑐3 = 10 + 9 + 6 + 3 = 28

4) Find Correction Factor:

𝐶.𝐹. = (𝐺 .𝑇.)2

𝑁=

(28)2

12= 65.33

5) Calculate CSS

𝐶𝑆𝑆 = 𝑐1

2

𝑛1+

𝑐22

𝑛2+

𝑐32

𝑛3+𝑐4

2

𝑛4 − 𝐶.𝐹.

= 102

3+

92

3+

62

3+

32

3 – 65.33

= [33.33 + 27 + 12 + 3] – 65.33

= 75.33 – 65.33

= 10

6) Calculate TSS

𝑇𝑆𝑆 = [ 𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑎𝑙𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠] – 𝐶.𝐹.

− 65.33

0 25 16 9

81 9 0 0

1 1 4 0

= 82 = 35 = 20 = 9

Page 163: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

155

= [82 + 35 + 20 + 9] – 65.33

= 146 – 65.33

= 80.67

7) Find ESS

𝐸𝑆𝑆 = 𝑇𝑆𝑆 – 𝐶𝑆𝑆 = 80.67 – 10 = 70.67

8) Degree of freedom

For CSS = 𝑐 − 1 = 4− 1 = 3

For ESS = 𝑐(𝑟 − 1) = 4(3− 1) = 8

9) One Way ANOVA table

Source Sum of squares d.f. MSS F ration

CSS 10 3

10

3

= 3.33 𝐹 = 8.83

3.33

= 2.501 ESS 70.67 8

70.67

8

= 8.83

10) Given; at 5% level of significance; 𝑇𝑎𝑏 𝐹8,3 = 4.07

𝐶𝑎𝑙 𝐹2,12 = 2.501

𝐶𝑎𝑙 𝐹 ≤ 𝑇𝑎𝑏 𝐹

So, accept 𝐻0

Thus, the varieties do not differ significantly among themselves.

b) Two-way ANOVA:

Ex.3. The following are the number of defectives produced by four workmen operating in turn

three different machines. Perform two-way ANOVA and check the difference between workmen and

machines.

(Given; At 5% F6,2 = 5.14 and F3,6 = 1.94)

Machines Workmen

X1 X2 X3 X4

Y1 36 39 42 40

Y2 32 41 39 32

Y3 37 36 44 26

Ans.:

1) Set up null Hypothesis.

a) for columns (workmen)

Page 164: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

156

𝐻0 = There is no significant difference between workmen.

b) for rows (machines)

𝐻0 = There is no significant difference between machines.

2) Minimize data by subtracting the smallest observation (i.e. 26) from all and calculate

column sums & row sums

Data become;

X

Y

X1 X2 X3 X4

Row

totals

Y1 10 13 16 14 𝑟1 = 53

Y2 6 15 13 6 𝑟2 = 40

Y3 11 10 18 0 𝑟3 = 39

Total 𝑐1 = 27 𝑐2 = 38 𝑐3 = 47 𝑐4 = 20

3) Calculate Grand Total

𝐺.𝑇. = 𝑐1+𝑐2 + 𝑐3+𝑐4 = 𝑟1 + 𝑟2 + 𝑟3

= 27 + 38 + 47 + 20 = 53 + 40 + 39

= 132

4) Find Correction Factor:

𝐶.𝐹. = (𝐺 .𝑇.)2

𝑁=

(132)2

12= 1452

5) a) Calculate CSS

𝐶𝑆𝑆 = 𝑐1

2

𝑛1+

𝑐22

𝑛2+

𝑐32

𝑛3+𝑐4

2

𝑛4 − 𝐶.𝐹.

= 272

3+

382

3+

472

3+

202

3 – 1452

= [243 + 481.33 + 736.33 + 133.33] – 1452

= 1593.99 – 1452

= 141.99

b) Calculate RSS

𝑅𝑆𝑆 = 𝑟1

2

𝑛1 +𝑟2

2

𝑛2 +𝑟3

2

𝑛3 – 𝐶.𝐹.

= 532

4+

402

4 +

392

4− 1452

= [702.25 + 400 + 380.25] – 1452

= 1482.5 – 1452

= 30.5

6) Calculate TSS

𝑇𝑆𝑆 = [ 𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑎𝑙𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠] – 𝐶.𝐹.

Page 165: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

157

− 1452

= [257 + 494 + 749 + 232] – 1452

= 1732 – 1452

= 280

7) Find ESS

𝐸𝑆𝑆 = 𝑇𝑆𝑆 – 𝐶𝑆𝑆 = 280 – (142 + 30.5) = 107.5

8) Degree of freedom

For CSS = 𝑐 − 1 = 4− 1 = 3

For RSS = 𝑟 − 1 = 3− 1 = 2

For ESS = 𝑐 − 1 𝑟 − 1 = (4− 1)(3− 1) = 6

9) Two Way ANOVA table

Source Sum of squares d.f. MSS F ration CSS 142 3 142

3

= 47.33

𝐹3,6 = 47.33

17.91 = 2.64

𝐹6,2 =17.91

15.25 = 1.174

RSS 30.5 2 30.5

2

= 15.25 ESS 107.5 6 107.5

6

= 17.91 10) a) For columns (workmen):

Given at 5% level of significance; 𝑇𝑎𝑏 𝐹3,6 = 1.94

𝐶𝑎𝑙 𝐹3,6 = 2.64

𝐶𝑎𝑙 𝐹 > 𝑇𝑎𝑏 𝐹

So, reject 𝐻0

Thus, there is significant between workmen.

b) For rows (machines):

Given at 5% level of significance; 𝑇𝑎𝑏 𝐹6,2 = 5.14

𝐶𝑎𝑙 𝐹6,2 = 1.174

𝐶𝑎𝑙 𝐹 ≤ 𝑇𝑎𝑏 𝐹

So, accept 𝐻0

Thus, the there is no significant between machines.

ANOVA vs. T Test

A Student’s t-test tells that if there is a significant variation between groups. It compares means,

while the ANOVA compares variances between populations. ANOVA gives a single number (the F-

statistic) and one p-value to help to accept or reject the null hypothesis.

100 169 256 196

36 225 169 36

121 100 324 0

= 257 = 494 = 749 = 232

Page 166: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

158

Exercise

1. What do you mean by ANOVA? Explain in short.

2. Describe various steps for ANOVA.

3. The following table gives the yield of 15 samples plots under three variations of seed. You

are required to find if the average yields of land under different varieties of seed showing

significant difference.

A 20 21 23 16 20

B 18 20 17 15 25

C 25 28 22 28 32

4. Blood group A, B, and AB of four different persons were studied for a particular

characteristic. Set a table of analysis of variance and find out whether there is existence of

any significant difference between the mean of persons blood with their blood group

varieties.

Persons Blood group

A B AB

1 7 9 10

2 4 7 6

3 7 5 7

4 6 6 9

5. The following data gives sales made by three MR. Perform analysis of variance to test

whether there is any difference between sales made by three MR.

A B C

300 600 700

400 300 300

300 300 400

500 400 600

0 - 500

6. The life time in hours for cells taken randomly from four individuals were observed as given

in table. On the basis of observation, analyze the values for their variance and study its

significance.

Page 167: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

159

Persons Cell life time (hrs)

1 2 3 4

P 61 68 60 58

Q 68 61 57 50

R 70 55 64 59

S 60 59 62 55

7. Samples of peanut butter produced by a company in three different batches are tested for

autotoxin content (p.p.h) and obtain following results. Use the 0.05 level of significance to

test whether the difference among the three samples means are significant.

Batch A 1.0 2.2 4.8 0.4 1.5 3.3

Batch B 0.7 1.2 5.2 3.6 1.8 2.5

Batch C 4.3 5.5 2.7 1.1 0.3 0.5

8. Two random samples were drawn from normal populations. Set up ANOVA table for given

data.

Sample 1 20 15 14 16 18 10 12 17

Sample 2 16 15 8 28 10 14 10 8 12 16

9. Following table contain % drug release of 15 samples containing different concentrations of

Excipients. Check whether the three varieties have any impact on different samples.

A 85 73 66 75 67

B 70 83 77 40 65

C 90 70 64 30 55

10. A company appoints four salesmen J, K, L, and M and observes their sales in three different

zones viz. A, B, C. Carry out ANOVA. (Note: figures are in lakhs)

Zones Salesmen

J K L M

A 55 41 48 56

B 52 51 56 49

C 49 49 46 48

Page 168: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

160

9. Chi-square test

Introduction:

In previous few chapters, the statistical inference has concentrated on the statistics such as

mean and proportion which have been used to obtain interval estimates and test hypotheses

considering population parameters. This chapter changes the approach to inferential statistics by

studying the whole distributions and relationship between two distributions. These inferences are

drawn using chi-square test.

The chi-square test is the most useful and widely used test in statistics for the assumptions

are minimal to perform the test. Thus, this test can be used in most circumstances.

Chi-square test:

The chi-square test is a procedure for testing if two categorical variables are related in some

population. The null hypothesis of the Chi-Square test is that no relationship exists on the

categorical variables in the population; they are independent. This test is based on the difference

between what is actually observed in the data and what would be expected if there was truly no

relationship between the variables.

The Chi-square test is denoted by 𝜒2 and given by,

𝜒2 = (𝑂𝑖 − 𝐸𝑖)

2

𝐸𝑖

Where, 𝑂𝑖 = Observed values

𝐸𝑖 = Expected values

Steps to perform chi-square test:

1) Set up the hypothesis as,

To test:

𝐻0: 𝑂𝑖 = 𝐸𝑖

Against:

𝐻1: 𝑂𝑖 ≠ 𝐸𝑖

2) Test statistics: Chi square

3) Formula:

𝜒2 = (𝑂𝑖 − 𝐸𝑖)

2

𝐸𝑖

4) Inference:

At 𝛼% level,

If 𝐶𝑎𝑙 𝜒2 ≤ 𝑇𝑎𝑏 𝜒2 , then Accept 𝐻0 .

Otherwise, reject 𝐻0 .

Page 169: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

161

Types of chi-square test:

There are two types of chi-square test.

1. Chi-square test of goodness of fit

2. Chi-square test for Independence

1. Chi-square test of goodness of fit:

It determines if the sample data match with certain population i.e. it tells if the sample data

represents the data we would expect to find in actual population. It can be used for the discrete

distributions like the Binomial distribution and Poisson distribution. The chi-square Goodness of fit

is to fit one categorical variable to a distribution.

2. Chi-square test for Independence:

It determines whether the two events are independent i.e. it allows researcher to determine

whether the variables are independent of each other or whether there is a pattern of dependence

between them. The chi-square test for independence compares two sets of data to see if there is a

relationship.

The hypothesis for the chi-square test of independence is stated as:

To test:

H0: The two categorical variables in the population are independent.

Against:

H1: The two categorical variables in the population are dependent.

Ex.1. During clinical trials of newly developed drug for treatment of diabetic retinopathy out of 120

volunteers, 76 persons were administered a new drug. Out of 76 persons, 24 persons showed

symptoms of DR. Amongst those not administered the new drug 12 persons were not affected by

DR. Find out whether the new drug is effective or not by using Chi-Square test.

(Given, at 5% level, 𝜒2 = 3.84)

Ans.: From given data, prepare table as given below;

DR status Drug

administered

Not

administered

Total

Developed 24 32 56

Not developed 52 12 64

Total 76 44 120

1) To test:

𝐻0: 𝑂𝑖 = 𝐸𝑖 i.e. Drug is not effective

Against:

𝐻1: 𝑂𝑖 ≠ 𝐸𝑖

Page 170: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

162

2) Test statistics: Chi square

3) Formula:

𝜒2 = (𝑂𝑖𝑗 − 𝐸𝑖𝑗 )2

𝐸𝑖𝑗

Where,

𝐸𝑖𝑗 =𝑐𝑖×𝑟𝑗𝑡𝑜𝑡𝑎𝑙

Now, calculate expected frequency as below;

𝑓𝑜𝑟 𝑂1 = 24, 𝐸1 =76 𝑋 56

120= 35.47

𝑓𝑜𝑟 𝑂2 = 32, 𝐸2 =44 𝑋 56

120= 20.53

𝑓𝑜𝑟 𝑂3 = 52, 𝐸3 =76 𝑋 64

120= 40.53

𝑓𝑜𝑟 𝑂4 = 12, 𝐸4 =44 𝑋 64

120= 23.64

Now, prepare following table;

Observed

frequency

(O)

Expected

frequency

(E)

𝑶 − 𝑬 (𝑶 − 𝑬)𝟐 (𝑶− 𝑬)𝟐

𝑬

24 35.47 - 11.47 131.5609 3.709

32 20.53 11.47 131.5609 6.408

52 40.53 11.47 131.5609 3.246

12 23.47 - 11.47 131.5609 5.605

TOTAL 18.968

𝜒2 = (𝑂 − 𝐸)2

𝐸= 18.968

4) Inference:

At 5% level and 1 d.f., 𝑇𝑎𝑏 𝜒2 = 3.84

𝐶𝑎𝑙 𝜒2 > 𝑇𝑎𝑏 𝜒2 ,

So, reject 𝐻0 .

Thus, drug is effective.

Ex.2. Following table shows the result of an experiment of study of effectiveness of vaccines on

resistance to a particular disease.

Page 171: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

163

Using Chi-square test, analyze the results of experiments for independence between vaccination

and attack.

(Given, at 5% level, 𝜒2 = 3.84)

Ans.: From given data, prepare table as given below;

1) To test:

𝐻0: 𝑂𝑖 = 𝐸𝑖 i.e. Vaccination and attack of disease are independent.

Against:

𝐻1: 𝑂𝑖 ≠ 𝐸𝑖

2) Test statistics: Chi square

Formula:

𝜒2 = (𝑂𝑖𝑗 − 𝐸𝑖𝑗 )2

𝐸𝑖𝑗

Where,

𝐸𝑖𝑗 =𝑐𝑖×𝑟𝑗𝑡𝑜𝑡𝑎𝑙

Now, calculate expected frequency as below;

𝑓𝑜𝑟 𝑂1 = 11, 𝐸1 =41 × 42

80= 21.525

𝑓𝑜𝑟 𝑂2 = 31, 𝐸2 =39 × 42

80= 20.475

𝑓𝑜𝑟 𝑂3 = 30, 𝐸3 =41 × 38

80= 19.475

𝑓𝑜𝑟 𝑂4 = 8, 𝐸4 =39 × 38

80= 18.525

Now, prepare following table;

Attacked No attacked

Vaccinated 11 31

Non-Vaccinated 30 8

Attacked No attacked Total

Vaccinated 11 31 42

Non-Vaccinated 30 8 38

Total 41 39 80

Page 172: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

164

(O) (E) 𝑶 − 𝑬 (𝑶 − 𝑬)𝟐 (𝑶− 𝑬)𝟐

𝑬

11 21.525 - 10.525 110.7756 5.1463

31 20.475 10.525 110.7756 5.4102

30 19.525 10.475 109.7256 5.6197

8 18.475 - 10.475 109.7256 5.9391

TOTAL 22.1153

𝜒2 = (𝑂 − 𝐸)2

𝐸= 22.1153

3) Inference:

At 5% level and 1 d.f., 𝑇𝑎𝑏 𝜒2 = 3.84

𝐶𝑎𝑙 𝜒2 > 𝑇𝑎𝑏 𝜒2 ,

So, reject 𝐻0 .

Thus, effect of vaccination on attack of disease is not independent.

Ex.3. A certain drug claimed to be effective in curing colds. In an experiment on 328 people with

colds, half of them were given sugar pills and half were given drug. The patients’ reactions to the

treatment were recorded. Test the hypothesis that drug is no better to the treatment than the sugar

pills.

(Given, at 5% level, 𝜒2 = 3.84)

Ans.: From given data, prepare table as given below;

1) To test:

𝐻0: 𝑂𝑖 = 𝐸𝑖 i.e. drug is no better to the treatment than the sugar pills.

Against:

𝐻1: 𝑂𝑖 ≠ 𝐸𝑖

Helped Harmed No effect

Drug 104 20 40

Sugar pills 88 24 52

Helped Harmed No effect Total

Drug 104 20 40 164

Sugar pills 88 24 52 164

Total 192 44 92

Page 173: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

165

2) Test statistics: Chi square

Formula:

𝜒2 = (𝑂𝑖𝑗 − 𝐸𝑖𝑗 )2

𝐸𝑖𝑗

Where,

𝐸𝑖𝑗 =𝑐𝑖×𝑟𝑗𝑡𝑜𝑡𝑎𝑙

Now, calculate expected frequency as below;

𝑓𝑜𝑟 𝑂1 = 11, 𝐸1 =41 × 42

80= 21.525

𝑓𝑜𝑟 𝑂2 = 31, 𝐸2 =39 × 42

80= 20.475

𝑓𝑜𝑟 𝑂3 = 30, 𝐸3 =41 × 38

80= 19.475

𝑓𝑜𝑟 𝑂4 = 8, 𝐸4 =39 × 38

80= 18.525

Now, prepare following table;

(O) (E) 𝑶 − 𝑬 (𝑶 − 𝑬)𝟐 (𝑶− 𝑬)𝟐

𝑬

11 21.525 - 10.525 110.7756 5.1463

31 20.475 10.525 110.7756 5.4102

30 19.525 10.475 109.7256 5.6197

8 18.475 - 10.475 109.7256 5.9391

TOTAL 22.1153

𝜒2 = (𝑂 − 𝐸)2

𝐸= 22.1153

3) Inference:

At 5% level and 1 d.f., 𝑇𝑎𝑏 𝜒2 = 3.84

𝐶𝑎𝑙 𝜒2 > 𝑇𝑎𝑏 𝜒2 ,

So, reject 𝐻0 .

Thus, drug is better to the treatment than the sugar pills.

Ex.4. In manufacturing company of mobile in each shift different numbers of faulty mobiles were

prepared which is shown in table. Test the hypothesis that the number of faulty samples prepared

is independent of the shift if the number of shift worked in the ration of 4:5:3.

(Given, at 5% level and 2 d.f. 𝜒2 = 5.99)

Page 174: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

166

Shift No. of faulty mobiles

Morning 10

Afternoon 24

Night 08

Ans.:

1) To test:

𝐻0: 𝑂𝑖 = 𝐸𝑖 i.e. stoppage is independent of shift

Against:

𝐻1: 𝑂𝑖 ≠ 𝐸𝑖

2) Test statistics: Chi square

Formula:

𝜒2 = (𝑂𝑖𝑗 − 𝐸𝑖𝑗 )2

𝐸𝑖𝑗

Total faulty pieces are 42 which expected in ration of 4:5:3 (Total for the ratio is 4 + 5 + 3 =

12).

So,

𝑓𝑜𝑟 𝑂1 = 10, 𝐸1 =4

12× 42 = 14

𝑓𝑜𝑟 𝑂2 = 24, 𝐸2 =5

12× 42 = 17.5

𝑓𝑜𝑟 𝑂3 = 8, 𝐸3 =3

12× 42 = 10.5

Now; prepare following table,

Shift (O) (E) 𝑶 − 𝑬 (𝑶− 𝑬)𝟐 (𝑶− 𝑬)𝟐

𝑬

Morning 10 14 -4 16 1.1428

Afternoon 24 17.5 6.5 42.25 2.4143

Night 08 10.5 -2.5 6.25 0.5952

TOTAL 4.1523

𝜒2 = (𝑂 − 𝐸)2

𝐸= 4.1523

3) Inference:

At 5% level and d.f., 𝑇𝑎𝑏 𝜒2 = 5.99

𝐶𝑎𝑙 𝜒2 < 𝑇𝑎𝑏 𝜒2 ,

So, accept 𝐻0 .

Thus, stoppage is independent of shift.

Page 175: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

167

Ex.5. In an industry, the number of accidents in three shifts was 12, 14, 19. Can we conclude that all

shifts are equally dangerous?

(Given at 5% level, 𝜒22 = 5.99 𝜒3

2 = 7.81)

Ans.:

1) To test:

𝐻0: 𝑂𝑖 = 𝐸𝑖 i.e. All shifts are equally dangerous.

Against:

𝐻1: 𝑂𝑖 ≠ 𝐸𝑖

2) Test statistics: Chi square

Formula:

𝜒2 = (𝑂𝑖𝑗 − 𝐸𝑖𝑗 )2

𝐸𝑖𝑗

Under null hypothesis, number of accidents will be equal to average.

𝐴𝑣𝑒𝑟𝑎𝑔𝑒 = 12 + 14 + 19

3= 15

So,

Now, prepare a table as follow;

Shift (O) (E) 𝑶 − 𝑬 (𝑶 − 𝑬)𝟐 (𝑶 − 𝑬)𝟐

𝑬

Morning 12 15 -3 9 0.6

Afternoon 14 15 -1 1 0.067

Night 19 15 4 16 1.007

TOTAL 1.6737

𝜒2 = (𝑂 − 𝐸)2

𝐸= 1.6737

3) Inference:

At 5% level and d.f., 𝑇𝑎𝑏 𝜒22 = 5.99

𝐶𝑎𝑙 𝜒2 < 𝑇𝑎𝑏 𝜒2 ,

So, accept 𝐻0 .

Thus, all shifts are equally dangerous.

Page 176: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

168

Exercise

1. A newly discovered drug was administered to 800 volunteers during clinical trials out of total

3000 volunteers. The number fever cases are shown below. Discuss the usefulness of drug (Use

X2 test at 1% level of significance).

Treatment Fever No fever Total

Drug 20 780 800

Placebo 200 2000 2200

Total 220 2780 3000

2. Certain drug is claimed to be effective in treatment of migraine. During clinical trials, out of 400

people suffering from migraine, 200 were administered tablet containing drug and 200 were

administered placebo tablet. The patient responses were recorded as shown in table. Test the

hypothesis that drug is no better to treatment.

Treatment Cured Harmed No effect

Drug 135 20 45

Placebo 65 25 110

3. Following table shows the number of people with or without high blood pressure. Do data

reveal and association between age groups and acidity

Age groups No. Peoples High blood pressure

cases

20-30 100 10

30-40 350 200

40-50 350 265

50-60 200 170

4. Genetic theory states that children having one parent of blood group A and other of B will

always possesses one of the blood group out of A, AB, B and that the proportion of three types

will be on an average be as 1:2:1. A report states that out of 250 children having one of A

parent and one B parent 30% were found to be A, 45% type AB and remaining of type B. Test

the hypothesis by X2 test.

5. Theory predicts that the portion of mangos in the four groups J, K, L and M should be 9:3:3:1. In

an experiment among 1600 mangos the numbers in four groups were 880, 315, 280, 125. Thus;

experimental results support the theory.

Page 177: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

169

6. From the data given below about the treatment of 500 patients suffering from a disease, state

whether the new treatment is superior to the conventional treatment.

Treatment Favorable Non-favorable Total

New 280 60 340

Conventional 120 40 160

Total 400 100 500

(Given for 1d.f., 𝜒0.052 = 3.84)

7. In experiment on pea breading mendal, following frequencies on seed are obtained: 315 round

and yellow, 101 wrinkled and yellow, 108 round and green, 32 wrinkled and green. Theory

predicts that the frequencies should be in proportion 9: 3: 3: 1. Examine the correspondace

between theory and experiment. (Given at 5% level, 𝜒32 = 7.815)

8. From the following data use Chi-square test and conclude whether inoculation is effective in

preventing a disease.

Attacked Not attacked Total

Inoculated 31 469 500

Non-inoculated 185 1315 1500

Total 216 1784 2000

(Given for 1d.f., 𝜒0.052 = 3.84)

9. In manufacturing company of mobile in each shift different numbers of faulty mobiles were

prepared which is shown in table. Test the hypothesis that the number of faulty samples

prepared is independent of the shift if the number of shift worked in the ration of 4:5:3.

(Given, at 5% level and 2 d.f. 𝜒2 = 5.99)

Shift No. of faulty mobiles

Morning 10

Afternoon 24

Night 08

10. A certain drug claimed to be effective in curing colds. In an experiment on 328 people with

colds, half of them were given sugar pills and half were given drug. The patients’ reactions to

the treatment were recorded. Test the hypothesis that drug is no better to the treatment than

the sugar pills. (Given, at 5% level, 𝜒2 = 3.84)

Helped Harmed No effect

Drug 104 20 40

Sugar pills 88 24 52

Page 178: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

170

10. Non-parametric test

Introduction:

In the previous chapters, all the statistical parametric methods of hypothesis testing which

are z-test, student’s t-test, Carl Pearson’s correlation coefficient, ANOVA were based on some

underlying assumptions about the data like “normally distributed data”, “population variances are

equal”. These methods are referred as parametric tests. But what if the assumptions on which the

methods are based do not hold? In such case we need to use distribution-free or assumption-less

methods for hypothesis testing. Such distribution-free methods are called nonparametric tests.

Nonparametric tests are often used when the population data has an unknown distribution

or when the sample size is small. Nonparametric statistics are typically used on qualitative data.

This method is useful when the data has no clear numerical interpretation, and is best to use with

data that has a ranking of sorts. This type of statistics can be used without the mean, sample size,

standard deviation, or the estimation of any other related parameters when none of that

information is available.

Common nonparametric tests include Chi square, Mann Whitney U-test (Wilcoxon rank-

sum test), Kruskal-Wallis test, Friedman test and Spearman's rank-order correlation.

1. Mann Whitney U-test (Wilcoxon rank-sum test):

A popular nonparametric test to compare outcomes between two independent groups is the Mann

Whitney U test. The Mann Whitney U test, sometimes called the Mann Whitney Wilcoxon Test or the

Wilcoxon Rank Sum Test, is used to test whether two samples are likely to derive from the same

population. This test also compares the medians between the two populations. This test is often

performed as a two-sided test.

Steps:

1) Set up null hypothesis as follows:

To test:

H0: The two populations are equal

Against:

H1: The two populations are not equal.

2) Test statistic:

If 𝑛1 and 𝑛2 are two sample or population sizes then Mann Whitney U test denoted by U is

the smaller from 𝑈1 𝑎𝑛𝑑 𝑈2 defined below.

𝑈1 = 𝑛1𝑛2 +𝑛1(𝑛1 + 1)

2− 𝑅1

𝑈2 = 𝑛1𝑛2 +𝑛2(𝑛2 + 1)

2− 𝑅2

Page 179: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

171

Where, 𝑅1 = 𝑅𝑥 i.e. sum of ranks of data x.

𝑅2 = 𝑅𝑦 i.e. sum of ranks of data y.

3) Compute the test statistic:

Find ranks of given data and prepare the table as follows.

Note: the ranking in this case is given from lowest observation to the highest observation as

1 to 𝑛1 + 𝑛2. For repeated observations, average of ranks are assigned as ranks.

𝑥 𝑦 𝑅𝑥 𝑅𝑦

4) Set up decision rule and make conclusion:

For given 𝛼 level of significance,

If 𝐶𝑎𝑙 𝑈 ≤ 𝑇𝑎𝑏 𝑈 then Reject 𝐻0

Otherwise Accept 𝐻0 .

Ex.1. Consider a Phase II clinical trial designed to investigate the effectiveness of a new drug to

reduce symptoms of asthma in children. A total of n=10 participants are randomized to receive

either the new drug or a placebo. Participants are asked to record the number of episodes of

shortness of breath over a 1 week period following receipt of the assigned treatment. The data are

shown below.

Placebo 7 5 6 4 12

New Drug 3 6 4 2 1

Is there a difference in the number of episodes of shortness of breath over a 1 week period

in participants receiving the new drug as compared to those receiving the placebo? By inspection, it

appears that participants receiving the placebo have more episodes of shortness of breath, but is

this statistically significant? (Given at the 5% level of significance and 5 d.f., = 2 ).

Ans:

1) To test:

H0: The two populations are equal

Against:

H1: The two populations are not equal.

2) Test statistic:

𝑈1 = 𝑛1𝑛2 +𝑛1 𝑛1 + 1

2− 𝑅1

𝑈2 = 𝑛1𝑛2 +𝑛2(𝑛2 + 1)

2− 𝑅2

3) Prepare the table as follows.

Page 180: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

172

Ascending order Ranks

Placebo New Drug Placebo

(𝒙)

New Drug

(𝒚)

Placebo

(𝑹𝒙)

New Drug

(𝑹𝒚)

7 3 1 1

5 6 2 2

6 4 3 3

4 2 4 4 4.5 4.5

12 1 5 6

6 6 7.5 7.5

7 9

12 10

Total 37 18

𝑈1 = 5 × 5 +5(5 + 1)

2− 37 = 3

𝑈2 = 5 × 5 +5(5 + 1)

2− 18 = 22

Here, 𝑈1 < 𝑈2. So 𝑈 = 𝑈1 = 3

4) Given 5% level of significance and 5 d.f. 𝑇𝑎𝑏 𝑇 = 2

𝐶𝑎𝑙 𝑈 > 𝑇𝑎𝑏 𝑈

So accept 𝐻0 .

Thus, there a difference in the number of episodes of shortness of breath over a 1 week

period in participants receiving the new drug as compared to those receiving the placebo.

Ex.2. A new approach to prenatal care is proposed for pregnant women living in a rural

community. The new program involves in-home visits during the course of pregnancy in addition to

the usual or regularly scheduled visits. A pilot randomized trial with 15 pregnant women is

designed to evaluate whether women who participate in the program deliver healthier babies than

women receiving usual care. The outcome is the APGAR score measured 5 minutes after birth.

Recall that APGAR scores range from 0 to 10 with scores of 7 or higher considered normal

(healthy), 4-6 low and 0-3 critically low. The data are shown below.

Is there statistical evidence of a difference in APGAR scores in women receiving the new and

enhanced versus usual prenatal care? (given at 5% level, for 𝑛1 = 8,𝑛2 = 7 𝑇𝑎𝑏𝑈 = 10)

Ans:

1) To test:

H0: The two populations are equal

Against:

H1: The two populations are not equal.

2) Test statistic:

𝑈1 = 𝑛1𝑛2 +𝑛1 𝑛1 + 1

2− 𝑅1

Usual Care 8 7 6 2 5 8 7 3

New Program 9 9 7 8 10 9 6

Page 181: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

173

𝑈2 = 𝑛1𝑛2 +𝑛2(𝑛2 + 1)

2− 𝑅2

3) Prepare the table as follows.

Ascending order Ranks

Usual

Care

New

Program

Usual Care

(x)

New Program

(y)

Usual Care

(𝑹𝒙)

New Program

(𝑹𝒚)

8 9 2 1

7 8 3 2

6 7 5 3

2 8 6 6 4.5 4.5

5 10 7 7 7 7

8 9 7 7

7 6 8 8 10.5 10.5

3 8 8 10.5 10.5

9 13.5

9 13.5

10 15

Total 45.5 74.5

𝑈1 = 8 × 7 +8(8 + 1)

2− 45.5 = 46.5

𝑈2 = 8 × 7 +7(7 + 1)

2− 74.5 = 9.5

Here, 𝑈2 < 𝑈1. So 𝑈 = 𝑈2 = 9.5

4) Given at 5% level of significance and 5 d.f. 𝑇𝑎𝑏 𝑇 = 10

𝐶𝑎𝑙 𝑈 < 𝑇𝑎𝑏 𝑈

So reject 𝐻0 .

Thus, the two populations of APGAR scores are not equal in women receiving usual prenatal

care as compared to the new program of prenatal care.

Ex.3. A clinical trial is run to assess the effectiveness of a new anti-retroviral therapy for patients

with HIV. Patients are randomized to receive a standard anti-retroviral therapy (usual care) or the

new anti-retroviral therapy and are monitored for 3 months. The primary outcome is viral load

which represents the number of HIV copies per milliliter of blood. A total of 30 participants are

randomized and the data are shown below.

Std

Therapy

7500 8000 2000 550 1250 1000 2250 6800 3400 6300 9100 970 1040 670 400

New

Therapy

400 250 800 1400 8000 7400 1020 6000 920 1420 2700 4200 5200 4100 undetec

table

Is there statistical evidence of a difference in viral load in patients receiving the standard versus the

new anti-retroviral therapy? (Given at 5% level and for for 𝑛1 = 𝑛2 = 15 𝑇𝑎𝑏 𝑈 = 64)

Ans:

1) To test:

Page 182: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

174

H0: The two populations are equal

Against:

H1: The two populations are not equal.

2) Test statistic:

Because viral load measures are not normally distributed (with outliers as well as limits of

detection (e.g., "undetectable")), we use the Mann-Whitney U test.

𝑈1 = 𝑛1𝑛2 +𝑛1 𝑛1 + 1

2− 𝑅1

𝑈2 = 𝑛1𝑛2 +𝑛2(𝑛2 + 1)

2− 𝑅2

3) Prepare the table as follows.

Ascending order Ranks

x y 𝑅𝑥 𝑅𝑦

7500 400 undetectable 1

8000 250 250 2

2000 800 400 400 3.5 3.5

550 1400 550 5

1250 8000 670 6

1000 7400 800 7

2250 1020 920 8

6800 6000 970 9

3400 920 1000 10

6300 1420 1020 11

9100 2700 1040 12

970 4200 1250 13

1040 5200 1400 14

670 4100 1420 15

400 Undetectable 2000 16

2250 17

2700 18

3400 19

4100 20

4200 21

5200 22

6000 23

6300 24

6800 25

7400 26

7500 27

8000 8000 28.5 28.5

9100 30

245 220

Page 183: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

175

𝑈1 = 15 × 15 +15(15 + 1)

2− 245 = 100

𝑈2 = 15 × 15 +15(15 + 1)

2− 220 = 125

Here, 𝑈1 < 𝑈2. So 𝑈 = 𝑈1 = 100

4) Given at 5% level of significance and 𝑛1 = 𝑛2 = 15 𝑇𝑎𝑏 𝑈 = 64

𝐶𝑎𝑙 𝑈 > 𝑇𝑎𝑏 𝑈

So accept 𝐻0 .

Thus, there is no statistical evidence that the treatment groups differ in viral load.

Test for Paired data: following are the nonparametric test used in case of paired data like “before-

after” case.

Signed Test: The Sign Test is the simplest nonparametric test for matched or paired data. The

approach is to analyze only the signs of the difference scores. The test statistic for the Sign Test

is the number of positive signs or number of negative signs, whichever is smaller.

Steps:

1) Set up null hypothesis.

To test:

H0: The median difference is zero.

Against:

H1: The median difference is not zero

2) Test statistic:

The test statistic for the Sign Test is the smaller of the number of positive or negative signs.

3) Compute the test statistic:

Calculate the difference of 𝐵𝑒𝑓𝑜𝑟𝑒 𝑐𝑎𝑠𝑒 – 𝐴𝑓𝑡𝑒𝑟 𝑐𝑎𝑠𝑒. Count the number of positive signed

answer and negative signed answer for the test statistic.

4) Set up decision rule and make conclusion:

For given 𝛼 level of significance,

If 𝑡𝑕𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑜𝑟 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑖𝑔𝑛𝑠 ≤ 𝑇𝑎𝑏 𝑣𝑎𝑙𝑢𝑒 then Reject 𝐻0

Otherwise Accept 𝐻0 .

Ex.1. A new chemotherapy treatment is proposed for patients with breast cancer. Investigators

are concerned with patient's ability to tolerate the treatment and assess their quality of life both

before and after receiving the new chemotherapy treatment. Quality of life (QOL) is measured on

an ordinal scale and for analysis purposes, numbers are assigned to each response category as

follows: 1=Poor, 2= Fair, 3=Good, 4= Very Good, 5 = Excellent.

The data are shown below.

Page 184: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

176

Patient QOL Before Chemotherapy

Treatment

QOL After Chemotherapy

Treatment

1 3 2

2 2 3

3 3 4

4 2 4

5 1 1

6 3 4

7 2 4

8 3 3

9 2 1

10 1 3

11 3 4

12 2 3

1) Test whether there is a difference in QOL after chemotherapy treatment as compared to

before. (Given at 5% level and 𝑛 = 12, 𝑝 = 2)

Ans:

2) Set up null hypothesis.

To test:

H0: The median difference is zero.

Against:

H1: The median difference is not zero

3) Test statistic:

The test statistic for the Sign Test is the smaller of the number of positive or negative signs.

4) Compute the test statistic:

Calculate the difference of 𝐵𝑒𝑓𝑜𝑟𝑒 𝑐𝑎𝑠𝑒 – 𝐴𝑓𝑡𝑒𝑟 𝑐𝑎𝑠𝑒. Count the number of positive signed

answer and negative signed answer for the test statistic.

Patient QOL Before

Treatment

QOL After

Treatment

Difference

(After-Before)

1 3 2 -1

2 2 3 1

3 3 4 1

4 2 4 2

5 1 1 0

6 3 4 1

7 2 4 2

8 3 3 0

9 2 1 -1

10 1 3 2

11 3 4 1

12 2 3 1

Now, here there are two zeros, so randomly assign one negative sign (i.e., "-" to patient 5) and one

positive sign (i.e., "+" to patient 8), as follows:

Page 185: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

177

Patient QOL Before

Treatment

QOL After

Treatment

Difference

(After-Before)

Sign

1 3 2 -1 -

2 2 3 1 +

3 3 4 1 +

4 2 4 2 +

5 1 1 0 -

6 3 4 1 +

7 2 4 2 +

8 3 3 0 +

9 2 1 -1 -

10 1 3 2 +

11 3 4 1 +

12 2 3 1 +

the number of negative signs which is equal to 3.

5) Given at 5% level and 𝑛 = 12, 𝑝 = 2

𝑡𝑕𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑜𝑟 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑖𝑔𝑛𝑠 > 𝑇𝑎𝑏 𝑣𝑎𝑙𝑢𝑒

So accept 𝐻0 .

Thus, there is no difference in QOL after chemotherapy treatment as compared to before.

Wilcoxon Signed Rank

Steps:

1) Set up null hypothesis.

To test:

H0: The median difference is zero.

Against:

H1: The median difference is not zero

2) Test statistic:

The test statistic for the Wilcoxon Signed Rank Test is W, defined as the smaller of W+ and

W- which are the sums of the positive and negative ranks, respectively.

3) Compute the test statistic:

Calculate the difference of 𝐵𝑒𝑓𝑜𝑟𝑒 𝑐𝑎𝑠𝑒 – 𝐴𝑓𝑡𝑒𝑟 𝑐𝑎𝑠𝑒.

Then assign ranks to the differences from 1 to n without considering negative signs and the

give negative signs to rank of negative difference. Assign mean of ranks for repeated

differences.

Page 186: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

178

4) Set up decision rule and make conclusion:

For given 𝛼 level of significance,

If 𝐶𝑎𝑙 𝑊 ≤ 𝑇𝑎𝑏 𝑊 then Reject 𝐻0

Otherwise Accept 𝐻0 .

Ex.1. A study is run to evaluate the effectiveness of an exercise program in reducing systolic blood

pressure in patients with pre-hypertension (defined as a systolic blood pressure between 120-139

mmHg or a diastolic blood pressure between 80-89 mmHg). A total of 15 patients with pre-

hypertension enroll in the study, and their systolic blood pressures are measured. Each patient then

participates in an exercise training program where they learn proper techniques and execution of a

series of exercises. Patients are instructed to do the exercise program 3 times per week for 6 weeks.

After 6 weeks, systolic blood pressures are again measured. The data are shown below.

Patient Systolic Blood Pressure

Before Exercise Program

Systolic Blood Pressure

After Exercise Program

1 125 118

2 132 134

3 138 130

4 120 124

5 125 105

6 127 130

7 136 130

8 139 132

9 131 123

10 132 128

11 135 126

12 136 140

13 128 135

14 127 126

15 130 132

Is there is a difference in systolic blood pressures after participating in the exercise program as

compared to before? (Given at 5% level and n= 15,𝑊 = 25)

Ans:

1) Set up null hypothesis.

To test:

H0: The median difference is zero.

Against:

H1: The median difference is not zero

2) Test statistic:

The test statistic for the Wilcoxon Signed Rank Test is W, defined as the smaller of W+ and

W- which are the sums of the positive and negative ranks, respectively.

Page 187: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

179

3) Compute the test statistic:

Patient BP Before

Exercise Program

BP After

Exercise

Program

Difference

(Before-After)

1 125 118 7

2 132 134 -2

3 138 130 8

4 120 124 -4

5 125 105 20

6 127 130 -3

7 136 130 6

8 139 132 7

9 131 123 8

10 132 128 4

11 135 126 9

12 136 140 -4

13 128 135 -7

14 127 126 1

15 130 132 -2

Rank table:

Observed

Differences

Ascending order of

Differences

Signed Ranks

7 1 1

-2 -2 -2.5

8 -2 -2.5

-4 -3 -4

20 -4 -6

-3 -4 -6

6 4 6

7 6 8

8 -7 -10

4 7 10

9 7 10

-4 8 12.5

-7 8 12.5

1 9 14

-2 20 15

Page 188: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

180

Here, 𝑊+= 89 𝑎𝑛𝑑 𝑊−= 31

𝑊− < 𝑊 +

Hence, 𝑊 = 𝑊−= 31

4) Given at 5% level and n= 15,𝑊 = 25,

If 𝐶𝑎𝑙 𝑊 > 𝑇𝑎𝑏 𝑊

So accept 𝐻0 .

Thus, there is no significant difference in systolic blood pressures after the exercise program as

compared to before.

2. The Kruskal-Wallis Test:

Like ANOVA in testing of hypothesis, the nonparametric Kruskal-Wallis test is used to

compare more than two parameters or medians. It is sometimes described as an ANOVA with the

data replaced by their ranks.

Steps:

1) Set up null hypothesis:

To test:

H0: The four population medians are equal versus

Against:

H1: The four population medians are not all equal

2) Test statistic:

The Kruskal-Wallis test is denoted by H and given by,

𝐻 = 12

𝑁(𝑁 + 1) 𝑅1

2

𝑛1+𝑅2

2

𝑛2+𝑅3

2

𝑛3+⋯ − 3(𝑁 − 1)

Where, 𝑘 = number of columns

𝑁 = total numner of observations

𝑛1 ,𝑛2 ,𝑛3 ,… are number of observations in respective columns

𝑅1 ,𝑅2 ,𝑅3 ,… are sum of ranks in each respective column.

3) Compute the test statistic:

Arrange the given data in ascending order for each column and assign them ranks. Take

average for repeated ranks. Calculate test statistic.

4) Set up decision rule and make conclusion:

For given 𝛼 level of significance,

If 𝐶𝑎𝑙 𝐻 ≤ 𝑇𝑎𝑏 𝐻 then Accept 𝐻0

Otherwise Reject 𝐻0

Ex.1. A personal trainer is interested in comparing the anaerobic thresholds of elite athletes.

Anaerobic threshold is defined as the point at which the muscles cannot get more oxygen to sustain

activity or the upper limit of aerobic exercise. It is a measure also related to maximum heart rate.

The following data are anaerobic thresholds for distance runners, distance cyclists, distance

swimmers and cross-country skiers.

Page 189: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

181

Distance

Runners

Distance

Cyclists

Distance

Swimmers

Cross-Country

Skiers

185 190 166 201

179 209 159 195

192 182 170 180

165 178 183 187

174 181 160 215

1) Is a difference in anaerobic thresholds among the different groups of elite athletes?(Given

at 5% level and 3 d.f. 𝐻 = 7.81)

Ans: Set up null hypothesis:

To test:

H0: The four population medians are equal

Against:

H1: The four population medians are not all equal

2) Test statistic:

The Kruskal-Wallis test is denoted by H and given by,

𝐻 = 12

𝑁(𝑁 + 1) 𝑅1

2

𝑛1+𝑅2

2

𝑛2+𝑅3

2

𝑛3+⋯ − 3(𝑁 − 1)

3) Compute the test statistic:

Total Sample (Ascending order) Ranks

Distance

Runners

Distance

Runners

Distance

Runners

Distance

Runners

𝑅1 𝑅2 𝑅3 𝑅4

159 1

160 2

165 3

166 4

170 5

174 6

178 7

179 8

180 9

181 10

182 11

183 12

185 13

187 14

190 15

192 16

195 17

201 18

209 19

215 20

Total 46 62 24 78

Page 190: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

182

Here, 𝑁 = 20 𝑛1 = 𝑛2 = 𝑛3 = 𝑛4 = 5

𝑅1 = 46 𝑅2 = 62 𝑅3 = 24 𝑅4 = 78

𝐻 = 12

20(20 + 1)

462

5+

622

5+

242

5+

782

5 − 3(20− 1)

= 0.0285 × 2524 − 3 21

= 71.934− 63

= 8.934

4) Given at 5% level and 3 d.f. 𝐻 = 7.81

If 𝐶𝑎𝑙 𝐻 > 𝑇𝑎𝑏 𝐻

So reject 𝐻0

Thus, there is a difference in median anaerobic thresholds among the four different groups of elite

athletes.

Summary:

Mann Whitney U Test:

Use: To compare a continuous outcome in two independent samples.

Null Hypothesis: H0: Two populations are equal

Test Statistic: The test statistic is U, the smaller of

𝑈1 = 𝑛1𝑛2 +𝑛1(𝑛1 + 1)

2− 𝑅1

𝑈2 = 𝑛1𝑛2 +𝑛2(𝑛2 + 1)

2− 𝑅2

Where, 𝑅1 = 𝑅𝑥 i.e. sum of ranks of data x.

𝑅2 = 𝑅𝑦 i.e. sum of ranks of data y.

Decision Rule: Reject H0 if U < critical value from table

Sign Test

Use: To compare a continuous outcome in two matched or paired samples.

Null Hypothesis: H0: Median difference is zero

Test Statistic: The test statistic is the smaller of the number of positive or negative signs.

Decision Rule: Reject H0 if the smaller of the number of positive or negative signs < critical

value from table.

Wilcoxon Signed Rank Test

Use: To compare a continuous outcome in two matched or paired samples.

Null Hypothesis: H0: Median difference is zero

Test Statistic: The test statistic is W, defined as the smaller of W+ and W- which are the sums

of the positive and negative ranks of the difference scores, respectively.

Decision Rule: Reject H0 if W < critical value from table.

Page 191: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

183

Kruskal Wallis Test

Use: To compare a continuous outcome in more than two independent samples.

Null Hypothesis: H0: k population medians are equal

Test Statistic: The test statistic is H,

𝐻 = 12

𝑁(𝑁 + 1) 𝑅1

2

𝑛1+𝑅2

2

𝑛2+𝑅3

2

𝑛3+⋯ − 3(𝑁 − 1)

Where,for 𝑘 = number of columns

𝑁 = total numner of observations

𝑛1 ,𝑛2 ,𝑛3 ,… are number of observations in respective columns

𝑅1 ,𝑅2 ,𝑅3 ,… are sum of ranks in each respective column.

Decision Rule: Reject H0 if H > critical value

Page 192: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

184

11. Experimental Design

Research and development department of Pharmacy is continuously working on various

experiments for betterment in health of human beings. It involves huge investment of time as

well as money. Therefore; it is very important that all experiments should plan properly so

that final results will of our interest.

In general, researchers formulate hypothesis first and then verify stated hypothesis by

collecting data through different kinds of experiments. It is very important to collect data

which is relevant to study and that come under heading of design of experiment.

Experiment: an experiment is a process or study that results in the collection of data.

Experimental Design: it is a way of carefully planning of experiments in advance so that

results will be both objective specific and valid.

It means planning of experiment, so that information will be collected which is relevant to the

stated problem.

The terms ‘Experimental Design’ and ‘Design of Experiments’ are used interchangeably.

Design of experiment involves complete planning of each and every steps that will result

generation of appropriate data in possible economical way.

Steps of experimental design:

Step 1: Define statement of problem

Step 2: Formulation of Hypothesis

Step 3: Design experimental method

Step 4: Examination of possible outcomes

Step 5: To observe necessary conditions

Step 6: Performance of experiment

Step 7: Collection of data

Step 8: Application of statistical techniques

Step 9: Drawing conclusion from result

Step 10: Evaluation of investigation

Purpose of an experimental design:

To provide maximum amount of information related to problem under investigation.

To save time, money, personnel and experimental material.

To obtain relevant information

Experimental design should satisfy following points;

It should be simple.

Page 193: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

185

Minimize or eliminate confounding variables, which can offer alternative explanations for the

experimental results. (A confounding variable is an “extra” variable that didn’t consider. They

can ruin an experiment and give false results.)

Allow to correlate different variables involved in experiment.

Describe how samples are allocated in groups or selected for experiment.

Design of experiments involves:

The systematic collection of data.

A focus on the design itself, rather than the results.

Planning changes to independent (input) variables and the effect on dependent variables or

response variables

Ensuring results are valid, easily interpreted, and definitive.

Basic principles of experimental designs:

Replication, Randomization and Local control are the three basic principles of experimental

designs.

1. Replication: it is essential feature of an experimental design. It involves repetition of experiment

number of times in order to obtain more reliable result. Replication is required to minimize

experimental error.

2. Randomization: it is critical principle of experimental design which guarantees that statistical

tests will have valid significance level and also ensure that the bias in the result will be

minimum. It helps to reduce errors in results.

Randomization tends to produce the study groups comparable with known as well as unknown

factors those affecting the results.

It makes the test valid by making it appropriate to analyse the data.

3. Local control: it is also known as error control. Local control refers to the amount of balancing,

blocking and grouping of experimental units. Replication together with local control helps to

reduce the experimental errors. Local control makes experimental design more efficient. The

main purpose of error control is to reduce magnitude of the estimate of experimental error.

Experimental Study Design and sample selection:

Various types of test designs that are used in Pharmacy for clinical trials, bioavailability and

bioequivalence studies are discussed below:

1. Completely randomized design:

In completely randomized design, all treatments are randomly allocated in between the all

experimental subjects.

Page 194: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

186

Method: Label all subjects with same number of digits. For example if there are 20 subjects,

number them 1-20. Randomly select non-repeating numbers from these labels for first treatment

and then repeat for other treatment.

Advantages

1. Simplest design

2. It can accommodate any number of subjects and treatment.

3. Although sample size might not me same for each treatment, this design is simple to analyze.

Limitations:

1. It is best suited for situation involves relatively few treatments.

2. All subjects must be as homogeneous as possible. Any extreme deviation results in error.

2. Randomized block design:

In this design, initially subjects are sorted into homogeneous groups called as block and

then within block treatment is assign randomly just same as discussed above.

Method: Subjects are classified into blocks depending on characteristics and within block

treatment is randomized. Each block is independent from another.

Advantages:

1. It can produce more precise results than completely randomized design.

2. It can accommodate any number of treatments or replications.

3. Blocking produces more comparable (homogeneous) group.

Limitations:

1. Statistical data analysis is relatively difficult.

2. Missing observation within block increases complexity of study.

3. Repeated measures, cross-over and carry-over design:

It is essential a randomized block design in which same subject is serves as block and each

block is use to study all planed treatments. Since; this study involves use of same subject several

times it is called as repeated measure design.

This method may involve single or several treatments at different time point.

Administration of two or more treatment one after other to the same group of patients is called as

cross-over design.

Method: Complete randomization is used to randomize the order of treatment for each subject

which is independent of other subject.

Advantages:

1. To check effect of treatment over time, this method is more perfect because it involves use of

same subject for all treatment.

2. Minimizes experimental error since; it involves use of same subject for complete study.

Page 195: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

187

Limitation:

It results in carry-over effect which may results in false results. Carry-over effect is nothing

but presence residual of previous treatment in subject at the time of current treatment.

This problem can be easily overcome by providing time enough time for wash-off drug from body.

4. Latin square designs:

In this method each subject receives each treatment during experiment.

It is two factor design i.e. subject and treatment with one observation in each cell.

This design minimizes the problem of carry-over effect by providing enough time for wash-

off of previous sample.

In this method rows represents subjects and columns represents treatments.

Lets consider n X n Latin square design is a square with n rows and n columns such that each

of n2 cell contain one n letter representing the treatment and each letter appears only once in

every row and column.

As an example consider 3X3 Latin square design for 3 subjects to compare 3 different tablets,

each containing different grades of an excipient affecting disintegration time of tablet. Design

will be consist of 3X3 = 09 experiments in 3 rows and 3 columns. Variables are represented as

X, Y, Z. Then following is a particular 3X3 Latin square.

Subject No. Study

Period I

Washing

period

Study

Period II

Washing

period

Study

Period III

Washing

period

1 X Y Z

2 Y Z X

3 Z X Y

If the first row and first column contain the n letters alphabetically then Latin square design

called as a standard square.

This design is used mainly in pharmaceutical field for bioequivalence study as well as in

agriculture field.

Advantages:

1. It minimizes carry-over effects.

2. It minimizes the inter-subject variability.

3. Make it possible to study formulation variables which are most important part of

bioequivalence study.

Limitations:

1. The study takes more time due to involvement of washing-period.

2. Patient dropout rate is high.

3. Randomization is somehow difficult.

Page 196: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

188

Sample size for study:

For an effective and economical clinical trial study, it is very essential that sample size should

be optimum.

For calculation of sample size, responses can be classified into two classes;

1. Binomial response: e.g. success or failure

2. Continuous response: e.g. measurement of blood pressure

1. Binomial response

2𝑁 =

2 [Zα

2p 1-p

+Zβ Pc 1-Pc + Pd 1-Pd

] 2

(𝑃𝑐 − 𝑃𝑑) 2

Where;

2N: Total number of subject = Nd+Nc

Nd: Number of subjects in drug group

Nc: Number of subjects in control

Pc: Success rate in control

Pd: Success rate in drug group.

Z: 1.645

Z: 1.282

𝑝 = 𝑅𝑑 − 𝑅𝑐

𝑁𝑑 + 𝑁𝑐

Where Rd and Rc are number of success in drug and control group respectively.

2. For Continuous response:

2𝑁 = 4 𝜎 2(𝑍𝛼 + 𝑍𝛽) 2

𝛿 2

δ: the true difference which is to be detected between the control and drug group.

Zα and Zβ: 1.96 and 1.282 respectively

σ : variance

Exercise

1. Define experimental design. Give its Significance.

2. Enlist steps involved in experimental design.

3. Explain in details various types of experimental designs use in clinical trial study.

4. Add note on randomization, replication and local control.

5. Give formula for calculation of sample size for clinical trials.

Page 197: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

189

12. Application of biostatics in Pharmacy

What is Statistics?

Statistics is a branch of mathematics which deals with the collection, organization, analysis,

interpretation and presentation of data obtained from experiments or survey.

Initially; the use of statistic was restricted to collect the information related to health,

population property etc by the respected governing body. Later on development in statistics field

made it as an essential part of different fields. Now a day, almost all fields involve applications of

statistics which includes Pharmacy, Medical, Political Science, Commerce, Business, Media and

many more. Moreover; every person uses statistics in daily life although he/she may not be aware

that the practice is called as statistics.

Progress in computer field further made utilization of statistic in simple way. With

application of computer and newly developed software one can handle huge numerical data and

obtain result in fraction of minute. Even by using advanced android phones people can apply

statistics at any where for data analysis and presentation of it.

What is biostatistics?

Biostatistics is the term used when tool of statistics are applied to data obtained from

biological areas. In another word biostatistics is the application of statistics to a wide range of

topics in biology.

Biostatistics involves collection, summarization, analysis, interpretation and presentation of

data from various biological experiments especially medicine, pharmacy, agriculture and fishery. A

major branch of this is medical biostatistics, which is exclusively concerned with medicine and

health.

In medicine field, whether its research, diagnosis or treatment all depends on measurement

or counting. For example; disintegration of tablet either rapidly or slowly has no meaning unless it

is expressed in figures. Therefore; biostatistics also called as quantitative medicine.

Applications of Biostatistics:

1. In research and development of pharmaceutical industries:

Biostatistics plays crucial role in research and development of pharmaceutical industries.

Drug discovery and development, development of new formulation of existing drug, development of

generic products all these major areas involve applications of biostatistics.

It takes about 10-15 years to develop one new medicine from the time it is discovered to

when it is available for treating patients. The average cost to research and develop each successful

drug is estimated to be $800 million to $1 billion. Overall process involve discovery of drug

molecules, screening of drug molecules, preclinical study, submission of IND i.e. Investigational

New Drug Application, clinical trials submission of NDA i.e. New Drug Application for approval of

drug for marketing purpose. Starting from discovery phase to submission of NDA it involves

application of biostatics. It involves experimental design, stating hypothesis, checking probability,

selection of sampling techniques, collection of data through different experiments, arrangement of

Page 198: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

190

collected data, analysis and interpretation of result. Investigator need to take prior permission from

FDA for both preclinical and clinical trials where he has to submit data which proves safety and

efficacy of drug for further study. FDA monitors all generated data very carefully and if satisfied

then only permits to investigator for next stage study. All this data is numerical and involve

different calculations and representation of generated data in specific manner so that appropriate

conclusion can be drawn from it. As already stated, overall this process involves huge investment of

money as well as time, any error in result will definitely lead investigator in tremendous loss. All

these errors can be avoided or minimized by application of biostatistics.

Similarly; in development of generic products, applicant has to prove bioequivalence of

developed product with innovators product and submit data to FDA in ANDA i.e. Abbreviated New

Drug Application which again involve application of biostatics.

2. In anatomy and physiology study:

i) Biostatistics is used to study various physiological and anatomical parameters and its correlation

with health e.g. mean pulse rate, average glucose level, mean and variance of weight and height.

For example; the mean height of boys in Maharashtra is less than that in Punjab. i.e. this difference

is due to natural variation or because of difference in nutrition that can be studied with

application of statistics.

ii) Biostatistics is also used for the study of normal and healthy population and to set limits for

abnormality.

3. In pharmacology:

i) In pharmacology, biostatistics plays important role to find action of drug on animal or humans.

ii) It is also used to compare two different drugs or two different formulations of same drug or two

identical dosage form from different manufacturer.

4. In medicine:

i) Biostatistics is used to find relation between two factors like T.B and smoking.

ii) It can be used to compare the efficacy of drug.

iii) Signs and symptoms of disease or syndrome are identified by using statistic study. E.g. In

typhoid, fever is observed almost in all cases and cough is rare.

5. In community medicine and public health:

i) Biostatistics is used in community medicine to find the usefulness of sera and vaccines,

comparison between vaccinated and unvaccinated groups etc.

ii) It is used for epidemiological studies to find the role of causative factors e.g. deficiency of calcium

in general orthoporosis.

iii) In public health, it can be used to check effectiveness of preventive measures. For example; fall

in death rate may be the result of availability of modern facilities in hospitals, advancement in

medicines or due to increase in awareness of public.

Exercise

1. Define statistics and biostatistics. Why biostatistics is called as quantitative medicine?

2. Explain applications of biostatistics in pharmacy in detail.

Page 199: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

191

References:

1. Dr. A. R. Paradkar , M. G. Dhayagude , Y. I. Shah. Introduction To Biostatistics And Computer

Science – For Medical and Pharmacy Students. Nirali Prakashan, 16th edition, 2019.

2. B. K. Mahajan. Methods in Biostatistics: For Medical Students and Research Workers. Jaypee

Publication, 7th edition, 2010.

3. Dr. Satguru Prasad. Elements of Biostatistics. Rastogi Publication, 3rd edition, 2019.

4. Khan and Khanum Shiba Khan. Fundamentals of Biostatistics. Ukazz/BSP Publication, 6th

edition, 2018.

5. Khanal Arun Bhadra. Biostatistics for Medical Students and Research Workers. Jaypee

Publication, 8th edition, 2016.

Page 200: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

192

Page 201: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

193

Page 202: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

194

Page 203: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

195

Page 204: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

196

Page 205: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

197

Page 206: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

198

Page 207: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

199

Page 208: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

200

Page 209: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

201

Page 210: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

202

Page 211: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

203

Page 212: Applied Biostatistics - bhumipublishing.com

Bhumi Publishing, India

204

Page 213: Applied Biostatistics - bhumipublishing.com

Applied Biostatistics: An Essential tool in Helathcare Profession

205

Page 214: Applied Biostatistics - bhumipublishing.com
Page 215: Applied Biostatistics - bhumipublishing.com