Biostat Lecture Note

8/3/2019 Biostat Lecture Note

http://slidepdf.com/reader/full/biostat-lecture-note 1/116

3/1/2010

1

Biostatistics

Samson G/Medhin, MPH

Course ObjectiveGeneral Objectives:

• To acquaint students with the basic and intermediatestatistical concepts and tools for collecting, analyzing,presenting and drawing conclusions from data.

2

Course Objective

Specific objectives:At the end of the course students will be able to:• Describe the scope and application of statistics;• Acquaint with the types of variables and scale of

measurements;• Describe data with appropriate diagrammatic and

numeric summery techniques;• Understand the basic rules of probability and their

statistical application in health sciences;• Comprehend different sources of health and

demographic data and appreciate their respectiveadvantage and disadvantage;

3

Course Objective Cont…• Understand the basic sampling techniques;• Calculate optimal sample size for different types of

studies;• Calculate and interpret confidence intervals;

• Carryout hypothesis testing about different statisticalparameters;• Understand and apply intermediate statistical methods

including correlation, linear regression, logisticregression and ANOVA;

• Carryout exploratory data analysis using SPSS;• Understand and interpret statements in published

articles pertaining to statistics.4



3/1/2010

2

Time Schedule• Time Schedule.doc

5

Mode of Evaluation• Mid 35%• Final 40%• Assignments/Quiz 10%• Term paper 15%

6

References1. M. Pagano and K. Gauvreau. Principles of Biostatistics, 2 nd ed.,

Duxbury Thompson Learning, 2000.2. T. Colton. Statistics in Medicine, Lippincott Williams & Wilkins

Publisher, 1974.3. B. Rosner. Fundamentals of Biostatistics, 6 th ed., Thomson

Books, 2006.4. M. Bland. An Introduction to Medical Statistics, 5 th ed., OxfordMedical Publications, 1993.

5. W. Daniel. Biostatistics: A Foundation for Analysis in HealthSciences, 8 th ed., John Wiley and Sons Inc, 2005.

6. Landau S and Everitt BS. Handbook of Statistical Analysesusing SPSS, Chapman & Hall/CRC, 2004.

7

Introduction



3/1/2010

3

What is Statistics?• Statistics is a field of study concerned with the collection,

organization and summarization of data, and drawing ofinferences about a body of data when only part of the datais observed.

• It is concerned with: – Designing experiments and data collection, – Summarizing information to aid understanding, – Drawing conclusions from data, – Estimating the present and predicting the future based

on Statistical evidence.9

What is Statistics?• Mathematical statistics: Concerns with the development

of new methods of statistical inference and requiresdetailed knowledge of abstract mathematics.

• Applied statistics: Involves applying the method ofmathematical statistics to specific subject areas.

• Biostatistics is an application of statistical method toBiological phenomena.

10

What is Statistics cont…

• In clinical medicine and PH Statistics can be applied to: – Determine the accuracy of measurement, – To compare measurement techniques, – To assess diagnostic tests, – To determine normal value, – To estimate prognosis, – To compare efficacy of treatment techniques, – To determine prevalence of an event, – To identify determinates of health problem, – To compute adequate sample size for studies. – Etc.

11

Statistical Data

• Refers to numerical description of things through theform of count or measurement.

• Though statistical data always involves numericdescription, all numeric descriptions are not statistical

data.• Statistical data should have the following characteristics:

– They must be in aggregate, – They must be affected to marked extent by multiple causes, – They must be collected in systematic manner, – They must be estimated at reasonable accuracy, – They must be placed in relation to each.

12



3/1/2010

4

Classification of Statistics• Descriptive Statistics: Is the methodology of effectively

collecting, organizing and describing data.• Inferential Statistics: Includes:

• Inductive Statistics: The process of drawingconclusion about unknown characteristics of apopulation, based on sample based study.

• Predictive Statistics: The process of predicting futurebased on historical data.

13

Classification Cont..• During analysis based on the underlying assumptions,

statistics (statistical methods) can be classified as:• Parametric statistics: is a branch of statistics that

assumes data come from a type of probabilitydistribution and makes inferences about the data basedon the distribution.

• Non-parametric statistic: Interpretation does not depend

on the population fitting any distributions.

14

Rationale of StudyingStatistics

• Enable to organize information in formal manner.• Issues in science are becoming more and more

quantitative,

• Statistics is extensively used in medical literature.• The planning, conducting and implementing of medicaland public health research are highly reliant on statisticalmethods.

• There is a great deal of intrinsic variations in mostbiological process.

15

Possible Limitations ofStatistics

• It mainly deals with variables which can be quantified.• It deals on aggregate of facts; it may not give individual

information.• Highly reliant on cutoff points.• Analysis is done based on multiple assumptions.• Errors are possible in statistical decisions.

16



3/1/2010

5

Types of Variables• A variable is any characteristic of a study unit (example

an individual) that is measureable and/or classifiable,and can take any value for different units.

• Depending on their quantifiablity, can be classified asQualitative and Quantitative variables.

• Qualitative (Categorical) Variable: is a characteristicwhich can not be measured in quantitative form but can

be identified by names or categories. For examplereligion, ethnicity, illness status (well or ill), treatmentoutcome (improved or not improved), Stage of breastcancer (I, II, III, IV) etc

17

Types of Variables Cont…• Quantitative Variable: is a characteristic that can be

measured and expressed numerically.• This can be of two types:• Discrete Quantitative Variable:

– Can only take on a finite number of values (usually wholenumbers).

– Example: number of children, number of episode of illness.

• Continuous Quantitative Variable: – Measured on continuous scale. – It can assume infinite number of values between two given

values. – Example: height, weight, age, blood sugar level.

18

Scale of Measurement

• In clinical medicine and public health as in many otherareas of science, we typically assign numbers to variousattributes of people, objects, or concepts.

• This process is known as measurement.

• The process of measurement involves assigningnumbers to observations according to rules.

• The way that the numbers are assigned determines thescale of measurement.

• Four scales of measurement are typically discussedhere.

19

Scale of Measurement Cont…

Nominal Scale:• Is the lowest scale of measurement.• Numbers are assigned to categories as "names"

arbitrarily.• Therefore, the only number property of the nominal scale

of measurement is “identity”.• For example classifying people according to gender is a

common application of a nominal scale. We may assignnumber "1" to "male" and number "2" to "female" or theopposite. The only mathematical operation we canperform with nominal data is to count.

20



3/1/2010

6

Scale of Measurement Cont…Ordinal Scale:• Ordinal scale has the property of magnitude.• It assigns each measurement to one of a limited number

of categories that are ranked in terms of graded order.• However the interval between the categories is not

necessarily equal.• Example: Cancer stage, rank in a race.

21

Scale of Measurement Cont…Interval Scale:• Interval scale has property of equal interval b/n values.• It doesn’t have a true zero point; the number "0" is

arbitrary.• Similarly the ratio between two values on interval scale

doesn’t have meaningful interpretation.• Eg: in measuring temperature using 0C scale, we can

always be confident that the distance between 25 0C and35 0C is the same as the distance b/n 65 0C and 75 0C.

• However, 0 0C doesn’t mean there is no temperature.Similar, it would be inappropriate to say that 60 0Cdegrees is twice as hot as 30 0C degrees.

22

Scale of Measurement Cont…

Ratio Scale:• Ratio scale of measurement has the property of equal

interval between values and absolute/true zero.• These properties allow us to apply all mathematical

operations (addition, subtraction, multiplication, anddivision) in data analysis.

• The absolute/true zero allows us to know how manytimes greater one case is than another.

23

Data Collection Method

• In order to generate valid conclusion from a data,information has to be collected in a systematic manner.

• A haphazardly collected dataset is less likely to producevaluable and generalizable information.

• Data may be derived from several sources.• Depending on the source, it can be classified as Primary

or Secondary data.• Primary data is gathered for the first time by the

researcher for a given purpose; while,• Secondary data is data already collected by others, for

purposes other than the question of the research at hand.24



3/1/2010

7

Data Collection MethodCont…

Survey through interview:• A quantitative approach in which a standardized

questionnaire, to be administered through interview, isused to collect information.

• Advantage – Quick and inexpensive, – Responses from different respondents is comparable, – Easy to quantify and analyze, – Useful in describing quantifiable characteristics of a

large population,

25


– Very large and representative samples are feasible, – Standardized questions make measurement more

precise, – Participants do not need to be able to read and write

to respond,

• Disadvantage: – Doesn’t give qualitative information, – Doesn’t give opportunity to probe and explore, – Relatively inflexible, – Less reliable to assess behavior and attitude of

respondents, 26


Survey through self administered questionnaire:• A quantitative method in which a standardized

questionnaire, to be filled by the respondentsthemselves, is used.

Advantage:• Quick and inexpensive,• Responses from different respondents is comparable,• Useful in describing quantifiable characteristics of a large

population,• Very large and representative samples are feasible,• Standardized questions make measurement more

precise. 27


• Disadvantage: – Participants need to be able to read and write to

respond, – High non-response rate, – Doesn’t give qualitative information, – Doesn’t give opportunity to probe and explore, – Less reliable to assess behavior and attitude of

respondents, – Relatively inflexible,

28



3/1/2010

8


Secondary data:• A quantitative approach which utilizes data already

collected by others.• Advantage:

– Less resource and time consuming,• Disadvantage:

– May not give in depth information, – No knowledge on the accuracy of data collection, – Can be outdated, – Limited control on the sampling method and size, – Less likely to give qualitative information.

29


Focus Group Discussion (FGD):• A qualitative method to obtain in-depth information on

concepts and perceptions about a certain topic throughspontaneous group discussion of approximately 6–12persons, guided by a facilitators.

• Advantage: – Excellent approach to gather information on in-depth

attitudes, and beliefs of a group, – Group dynamics might generate more ideas than

individual interviews, – Provides an excellent opportunity to probe & explore, – Participants are not required to read or write, 30


– Unearth sensitive issues which are not commonly raisedby individuals.

– It facilitates the exploration of collective memories.

Disadvantage: – Requires strong facilitator to guide discussion and

ensure participation by all members, – Doesn’t give quantitative information, – It is difficult to organize the discussion, – Analysis is relatively difficult.

31


In-depth interview:• A qualitative method that relies on person to person

discussion.• Advantage:

– Good approach to gather in-depth attitudes andbeliefs from individual respondents,

– Provides an excellent opportunity to probe andexplore,

– Participants don’t need to be able to read and write torespond,

– Assures privacy,32



3/1/2010

9


• Disadvantage: – Doesn’t give quantitative information, – It is time taking, – the respondent may feel like ‘a bug under a

microscope’, – The analysis is relatively difficult,

33


Observation:• A qualitative method that involves critical observation

and recording the practice (behavior, culture…) ofindividuals or a group.

• Excellent approach to discover behaviors,• Usually takes longer time,• Liable to “Observational bias”

34

Designing Questionnaire

• Most of the data collection techniques utilizequestionnaires.

• Hence, the quality of the data is dependant on how bestthe questionnaire is designed.

• There are two main objectives in designing aquestionnaire:• To obtain accurate relevant information for the study,• To maximize the response rate.

35

Designing QuestionnaireCont…

• A questionnaire can be classified based on different issues:• Structured Vs Non-structured Questionnaire:

– The structured one is mainly designed for surveys. – A series of questions are arranged in a logical order and

sequence and divided into subtopics. – Skipped patter is important for structured questionnaire. – The data collector is expected to smoothly go through the

sequence. – The non-structured one is commonly used for qualitative

studies. – It doesn’t have strict sequence of questions. – The data collector may rearrange the questions depending

on the response of the subject. 36



3/1/2010

10


• Open ended Vs Close ended Questionnaire(Question):

• Open ended questions permit free response that shouldbe recorded in respondent’s own word.

• Allows exploration of the range of possible themes.• Close ended questions offer a list of possible options or

answers from which the respondents must choose.

• It is relatively easy and quick to fill, code, analyze andreport.

37


Standardized Vs Non-standardized Questionnaire:• Standard questionnaire is developed by a well known

body and considered to be “standard” to assess a givenresearch question.

• A nonstandard one is developed by the researcher toaddress the research question.

• What are the advantages and disadvantages of using

standardized questionnaire?

38

Steps in Designing aQuestionnaire

1. Developing Individual Questions: – Use short and simple sentences. – Ask for only one piece of information at a time. – Ask precise questions to address the objective of the

study. – Give extra attention to sensitive questions. – Avoid leading questions.

2. Format of responses: Questions should be formattedinto open or closed formats depending on the need.

39

Steps Cont…

3. Arranging the Questions:• Go from general to particular.• Go from easy to difficult.• Go from factual to abstract.• Start with closed questions.• Start with demographic and personal questions.4. Piloting and Evaluation of Questionnaire.• Given the complexity of designing a questionnaire, it is

impossible even for the experts to get it right the firsttime round.

• Questionnaires must be pretested (piloted) on a smallsample of people characteristic of those in the survey. 40



3/1/2010

11

DiagrammaticSummarization

Introduction• Data collection yields a set of data called Raw Data.• The size of the data can range from a few hundreds to

many thousands of observations.• Raw data however will not necessarily provide

information that can easily be interpreted.• Data presentation is a mechanism which enables easier

understanding of a given set of data through the use of

tables and graphs.• In data summarization the detailness of the data is

compromised but this is compensated by gain inknowledge of the data.

42

Tables

• Simplest means of data presentation which can be usedfor all type of data.

Frequency Distribution

• One type information that is commonly used to organizedata in tables is Frequency Distribution.

• For nominal or ordinal data, the frequency distributionconsists of a set of categories along with numericcounts that correspond to each one.

• Example:

43

Tables Cont…Table 2.1: Ethnicity Composition of Women of Reproductive age in

Awassa Town, Jan 2006.

Ethnic Group Frequency Distribut ionWolita 377

Amhara 355

Sidama 163Oromo 144

Guragae 138

Kenbata 82

Tigray 47

Hadya 20

Others 50

Total 137644



3/1/2010

12

Tables Cont…• In displaying numeric data using frequency distribution

we should note the following:• The range of values must be broken-down into a series

of distinct and non-overlapping intervals.• The intervals should cover all data points.• Intervals are often constructed, though not necessarily,

so that all have equal width. This facilitates comparison

among classes.• Open ended intervals should be avoided.• The limits for each class must agree with the accuracy of

the raw data.45

Tables Cont…• Appropriate number of intervals should be considered as

too many intervals won’t be much explanatory and toofew intervals loose a great deal of information.

• The rule of thumb states the number of classes shouldbe between 10-20.

• When we don’t have any evidence to decide number ofclasses, we can use Sturge’s Formula:

• No of classes = 1+[3.322 x log (no of observations)]• The width of each class can also be calculated as:

)classesof No

Min value-Max value(classtheof Width =

46

Tables Cont…

Relative and Cumulative Frequency• In addition to counts, it is useful to know the proportion of

values that fall into a given class.• Relative frequency of a class is the proportion or

percentage of total number of observations that fall in agiven class.

• Cumulative relative frequency of a class is the proportion(percentage) of total number of observations that have avalue less than or equal to the upper limit of a giveninterval.

• If such information is given in the form of counts it issimply called Cumulative frequency.

47

Tables Cont…

Age Group Number of women Relative Frequency(%)

Cumulative RelativeFrequency (%)

15-19 399 28.9 28.9

20-24 341 24.7 53.6

25-29 281 20.4 74.0

30-34 143 10.4 84.3

35-39 116 8.4 92.8

40-44 54 3.9 96.7

45-49 42 3.0 100.0

Total 1380 100.0

Table 2.2: Cumulative and Relative Frequency of Age Structure of Women ofReproductive age in Awassa Town, Jan 2006.

48



3/1/2010

13

Tables Cont…• Depending on the number of variables represented in,

tables can be classified as one way, two way and higherorder tables.

• One-way Table: Only one variable is summarized in thetable.

• Two-way Table (Cross tabulation): Two variables areorganized simultaneously in combined manner in a table.

• Higher Order Table: Three or more variables arepresented simultaneously in a table. The higher orderthe table the more complicated the interpretation.

49

Tables Cont…Child Ever Born

>=5 < 5

E d u c a t i on al s t a t u s of w om e n

Illiterates 42 68

Read and Write 9 19

1st-4th grade 32 60

5th-8 th grade 46 211

9th

-12th

grade 42 239

> 12 th grade 7 68

Total 175 665

What type of table is this?

50

Tables Cont…Child’s Age Child’s Sex History of illness in the preceding 2 weeks

Yes No Total

0-11 mo

Male 15 86101

Female 18 84102

12-23 mo

Male 13 8093

Female 12 7890

24-35 mo

Male 10 7686

Female 11 7788

36-47 mo

Male 9 7483

Female 9 7382

48-59 mo

Male 6 6975

Female 7 7077

51

Tables Cont…

• In constructing tables, the following standards should befollowed: – Tables should be simple and self explanatory, – Every table should have a title (usually at the top of the table)

which indicates who, what, when, where of the data presented, – Row and columns should be labeled, – Totals should be indicated, – Numeric entities of zero should be written as “0” while missed or

unobserved data should be represented by “-”, – If the data are not original, there source should be given as

footnote, – Complicated tables should be avoided.

52



3/1/2010

14

Diagrammatic Representation

• A second way to present data is through the use of graphsor pictures. (Diagrammatic Representations).

• Though diagrammatic representation is easier to read thantables, they supply a lesser degree of details.

• However, the lesser detail can be compensated by a gainin understanding of the data.

• Diagrammatic representation has the following advantages:

– They are easier to understand and memorize, – They are more attractive, – They facilitate comparison among groups, – They may show pattern within the data set.

53

Bar Charts (Bar Graphs)

• Bar graphs are popular type of graph used to display afrequency distribution for Nominal or Ordinal data.

• In the case of the commonest Vertical Bar Graph(Column Graph), various categories into which theobservation falls are presented along horizontal axis.

• A vertical graph is drawn above each category so thatthe height of the bar represents either the frequency or

relative frequency of observations within that class.• The bar should have equal width, and separated from

one another so that not to imply continuity.• In the case of Horizontal Bar Graph, the vise-versa holds

true.54

Bar Charts Cont…Bar graph has different types:• Simple Bar Graph:

– Depicts the frequency /relative frequency of classes of a variable. – The intension is to compare the frequency of different classes of a

variable.

0

10

20

30

40

50

60

70

Within an hr 1-24 hr After the first day

The time breast feeding wa s initated

P e r c e n

t a g e o

f c

h i l d r e n a g e d

0 - 1 1

m o n

t h s

55

Bar Charts Cont…• Multiple Bar Graph:

– Depicts the frequency or relative frequency of classes of avariable at two or more situations.

– This type enables comparison between the levels of classes ofthe variable at different situations.

28

60

26

63.3

33.5

2.8

0

10

20

30

40

50

60

70

Wit hin an hr Wit hin a day A ft er t he firs t da y

The Time Breastfeeding was Initated

%Baseline

End line

56



3/1/2010

15

Bar Charts Cont…• Component Bar Graph: – Similar as that of simple bar graph except bars are divided into

components. – The graph shows the relative contribution of the components to

the bar (category).

0

10

20

30

40

50

60

70

W it hi n a n h r 1- 24 h r A ft er t he fi rs t da y

The time breastfeeding was initiated.

P e r c e n t a g e o f c h

i l d r e n a g e d

0 - 1 1

m o n t h s

Female

Male

57

Bar Charts Cont…• 100% Component Bar Graph:

– Similar as that of component bar graph. – But the height of all the bars is set at 100% so that comparison

on the relative contribution of the components can easily bemade.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Within an hr With in a day After the first day

Females

Males

58

Pie Chart

• A Pie Chart is a circular chart divided into sectors,illustrating relative magnitudes or frequencies of classesof a given variable.

• Pie chart usually represents categorical data but it is also

possible to use it for discrete quantitative data.• The angle of each sector has to be proportional to therelative frequency of a given class.

59

Pie Chart Cont….

60



3/1/2010

16

Histogram• Whereas Bar-chart is representation of a frequency

distribution for either nominal or ordinal data, a Histogramdepicts a frequency distribution for continuous data.

• The horizontal axis displays the true limit of the interval,the vertical axis represents the frequency or relativefrequency of the interval.

• If the interval of the bars is equal, the frequency

associated with each interval can be represented by theheight of the respective bars.• However if the bars have different width, the histogram

should be drawn in such a way that the Y axis representsthe frequency density and the X axis the interval.

61

Histogram Cont…• Then the respective frequency of the interval is

represented by the area of the bar.• Frequency density of an interval = frequency of the

interval /true class width.• Unlike Bar-graph, in the case of Histogram the

categories (bars) must be adjacent. Hence, in order toconstruct a Histogram, rather than class intervals, true

class boundaries should be used.• For example the following table summarizes theBiostatistics mid exam score of 38 students out of 35marks.

62

63

Frequency Polygon

• Frequency Polygon depicts a frequency distributioncontinuous numeric data.

• Frequency polygons are a graphical device forunderstanding the shapes of distributions.

• A Histogram can easily be changed to FrequencyPolygon by joining the mid points of the top of theadjacent rectangles of the Histogram with a line.

• It is also possible to draw Frequency Polygon withoutdrawing Histogram. The procedure is as follows:

64



3/1/2010

17

Frequency Polygon Cont…

1. Identify the mid points of all the intervals of the classesof the give data,

2. Plot the mid points (as X axis) with the respectivefrequency distribution or relative frequency of the class(as Y axis)

3. Connect adjacent plots with a straight line

65


• For example the following Frequency Distributionrepresents the ages (in years) of 60 patients at apsychiatric counseling centre.

66


• First we have to identify the mid points of each interval.

67

Frequency Polygon Cont…• Finally we have to plot the midpoints (as X axis) with respective

frequency of each class (as Y axis) and connect adjacent plots witha straight line.

68



3/1/2010

18

Scattered Plot (Scattered Graph)

• Scattered plot is used to show the relation between twodifferent continuous measurements.

• The scale for one quantity is marked on the X axis andthe scale for the other on the Y axis.

• Each point on the graph represents a pair of values forthe two measurements.

• For each value on the X axis, it is possible to have

multiple Y values.• The following scattered plot, shows the relation between

age and blood glucose level among diabetic patientsaged 50-70 years.

69

Scattered Plot Cont..

120

125130135140145150155160165170175180185190195200

50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70

Age in Years

B l o o d G l u c o s e l e v e l m g / d l

70

Line Graph

• A line graph is similar to scattered plot as it shows therelation between two different continuous measurements.

• Once again each point on the graph represents a pair ofvalues.

• However, unlike scattered plot, each value on the X axishas a single corresponding measurement on the Y axis.• As the name indicates, points on the graph are connected

to the adjacent points with straight line.• Most commonly the scale along the X axis represents time.

Consequently we are able to trace the chronologicalchanges.

71

Line Graph Cont…

Figure 2.8: Mean Number of Child Ever Born to Women at the Age of25 years, Awassa Town (1980-2005)

1

1.25

1.5

1.75

2

2.25

2.5

2.75

3

3.25

3.5

3.75

1980 1985 1990 1995 2000 2005

Year (GC)

M e a n

C h i l d E v e r B o r n a m o n g

W o m e n a t t h e

A g e o f 2 5

72



3/1/2010

19

Cumulative Line Graph

• Also known as Ogive Graph.• It is best used when you want to display the total at any

given time.• The relative slopes from point to point will indicate

greater or lesser increases.• For example, a steeper slope means a greater increase

than a more gradual slope.

• For example, if you saved $300 in both January andApril and $100 in each of February, March, May, andJune, the Ogive would looks like as follows.

73

Cumulative Line Graph

Cont…

74

Box and Whisker Plot

• In descriptive statistics box-and-whisker plot is aconvenient way of pictorially depicting groups ofnumerical data through their five-number summaries

• The smallest observation, 1 st quartile, median, 3 rd

quartile, and largest observation.

75

Box and Whisker Plot Cont…

• However in some cases the ends of the whiskers canrepresent several possible alternative values.

• For example In SPSS: – The ends of the whiskers represent lowest datum but

still within 1.5 times of the IQR of the lower quartile,and the highest datum still within 1.5 IQR of the upperquartile.

– Values more than three IQR’s from the end of a boxare labeled as extreme, denoted with an asterisk (*).Values more than 1.5 IQR’s but less than 3 IQR’sfrom the end of the box are labeled as outliers (o).

76

3/1/2010



3/1/2010

20

Stem and Leaf Plot

• Is a display that organizes data to show its shape anddistribution.

• Each data value is split into a "stem" and a "leaf" portion.• The "leaf" is the last digit of the number and the other

digits to the left of the "leaf" form the "stem".• For example, the number 42 would be split apart, with

the stem becoming the 4 and the leaf becoming the 2.

• Consider the following dataset, sorted in ascendingorder: 8, 13, 16, 25, 26, 29, 30, 32, 37, 38, 40, 41, 44,47, 49, 51, 54, 55, 58, 61, 63, 67, 75, 78, 82, 86, 95.

77

Stem and Leaf Plot Cont…

0|81|3 62|5 6 93|0 2 7 84|0 1 4 7 95|1 4 5 86|1 3 77|5 88|2 69|5

78

Pictogram

• Pictogram is a graph which uses pictures or symbols topresent a certain data.

• Usually presents the frequency of one or morecategorical or discrete numeric variables in the form of

symbols.• The magnitude of the can be shown either by the size ofthe picture or the number of pictures.

• For example the following pictogram represents thenumber of passengers per year across four airports ofUK.

79

Pictogram Cont…

80

3/1/2010



3/1/2010

21

Issues to be considered in

diagrammatic representation• Depending on the type of the data, the right type of

diagrammatic representation should be selected.• It is not common to use two or more types of

diagrammatic representation simultaneously for aspecific data. The best should be selected and used.

• Each graph and diagram should be labeled (usually thetitle is given below the figure).

• The title should indicate “Who”, “What”, “When” and“Where” of the data presented.• If the representation is taken from another source the

primary source should be indicated.

81

Issues to be considered

Cont…• In graphs, the X and Y axis should be indicated clearly

with their unit of measurement.• In graphs, the scale of X and Y axis should be drawn

proportionally.• Pictorial representations usually require “Key” to facilitate

easier interpretation.• When colors are employed, contrasting colors should be

selected.

82

Diagrammatic RepresentationUsing SPSS

• In order to develop graphs using SPSS, the followingsteps should be followed;

• Graphs > legacy dialogues > select appropriate graph• Available types are Bar graph, Pie chart, Histogram, Line

graph, Scattered plot and Box plot.• Other rarely used types are also there.• Most of the graphs can also be found under “Analysis >

Descriptive Statistics” icon.

83

Numeric Summarization

3/1/2010



3/1/2010

22

Introduction

• Even though diagrammatic representation greatlyenhance understanding of the data, it does not givemathematically amenable outputs.

• This gap is addressed by numeric summarization.• In summarizing a dataset using numeric indicators, we

often focus on describing the data with two summaryfigures. These are:

– Central Tendency (Location) – Variation (Spread)

85

Measures of Central Tendency

• One of the most commonly used measures tosummarize a set of data is its center.

• The center is value (usually a single value), chosen insuch a way that it gives a reasonable approximation ofthe whole dataset.

• In statistics the number which tends to approximate thecenter of a set of data is called Measure of CentralTendency or Average.

• The Arithmetic Mean, Median and Mode are the mostcommonly used measures of central tendency.

86

Measures of CentralTendency Cont…

Attributes of good measure of central tendency are:• It should be based on all observations.• It should not be affected by extreme values.• It should have a definite value.

• It should not be subjected to complicated computation.• It should be capable of further algebraic treatment.• It should be close to the location were majority of the

observations are located.

87

Arithmetic Mean

• The Arithmetic Mean is usually called the Mean.• It is most familiar measure of central tendency.• It is calculated by adding all of the individual values and

dividing the sum by the number of individual values.

• In statistics, two separate letters are used for the mean.• The Greek letter (mu) is used to denote the population

mean.• The symbol (read as "x bar") is used to denote the

sample mean.

88

3/1/2010



3/1/2010

23

Arithmetic Mean Cont…• When n is the total number of observations and X i is the value

of X for ith observation the formula of arithmetic mean is givenas:

• In calculating the mean from grouped data we assume allvalues falling into particular class interval are located at themid point of the interval.

• The formula is given as:

n

f m Mean

K

iii

=

n

x Mean

n

ii

== 1

89

Arithmetic Mean Cont…

Where k is the number of class intervals,mi is the mid point of the i th class interval,fi is the frequency of the i th class interval,n is total number of observations,

• The formula simply means each value within the intervalis represented by the midpoint of the true class interval.Then we can calculate the mean as usual.

90

Arithmetic Mean Cont…Example 3.1: Consider the time taken by 30 students to doa Biostatistics quiz.

Thus mean of the data is 350/30 = 11.7 minutes

Minutes spenton Quiz

Number of students (f)

True Class interval Mid point (m) m if i

1-5 2 0.5-5.5 3 66-10 12 5.5-10.5 8 96

11-20 16 10.5-20.5 15.5 248Total 30 350

91

Arithmetic Mean Cont…

• The major advantages of mean are: – It is calculated based on all observations. – Its mathematical computation is not complicated. – It accommodates further mathematical applications.

– It can only have one value.

• The major disadvantages of mean are: – It is affected by extreme values. – It shouldn’t be used when the dataset is not normally

distributed.

92

3/1/2010



3/1/2010

24

Median• The Median is the value which divides the data into two equal

halves, with half of the values being lower than the Medianand half higher than the median.

• When n is the number of observation in a dataset, the medianis calculated in such a way: – Sort the values into ascending order. – If you have an odd number of observations, the median is

the middle observation, i.e. (n+1)/2 position of your data.

– If you have an even number of observations, the median isthe arithmetic mean of the two middle observations, i.e.pick the numbers at positions n/2 and (n/2) + 1 and find themean of those two observations.

93

Median Cont…

Example 3.2: Compute the median for {1, 2, 3, 4, 5}• The numbers are already sorted, so that it is easy to see

that the median is 3 (two numbers are less than 3 andtwo are bigger).

Example 3.3: Compute the median for {1, 2, 3, 4, 5, 6}• The median would be 3.5 since that is the middle

between 3 and 4, computed as (3 + 4)/ 2.• Note that three numbers are less than 3.5, and three are

bigger, as the definition of the median requires.

94

Median Cont…• When we are dealing with grouped data, the median can be

calculated as:

• Where: – Lm is the lower true class boundary of the interval

containing the interval, – F c is cumulative frequency of the interval just above the

median class interval, – F m is frequency of the interval containing the median, – W is class interval width, – n total number of observations.

wF

F n

L X m

c

m )2(~

−+=

95

Median Cont…• The major advantages of the median are:

– Not affect by extreme values, – Can be used in skewed distribution, – It is easy to calculate,

– It can only has one value, – Can be calculated when there is open end interval.

• The major limitations of the median are: – It could not be a good representative if the number of

observations is too few, – It does not accommodates further mathematical

applications (in parametric statistics), – It is calculated based on one or two observations. 96

3/1/2010



25

Mode

• Mode is by far the simplest, but the least widely usedmeasure of central tendency.

• It is simply the score that occurs most frequently.• When the distribution has only one vale with highest

frequency it is called Unimodal. If it has two values withequal and highest frequency it is called Bimodal.Similarly, it is possible to have multimodal frequency.

• Example: {1, 2, 2, 3, 3, 4, 4, 4, 5}• The mode is 4.• In grouped data the mid point of the interval with highest

frequency is considered as the mode of the distribution.97

Mode Cont…

98

Salary in Br Number of Factory Workers

500-600 3

600-700 6

700-800 5

800-900 5

900-1000 0

1000-1100 1

Mode Cont…For example the following table displays the salary of 20factory workers in factory X.

mid point of this interval i.e. 650 is taken as the mode of

distribution.

99

Mode Cont…• The major advantages of the mode are:

– It can be used when the variable is ordinal or nominal, – It is very easy to compute, – It is less likely to be affected by extreme values,

– Can be calculated to distributions with open end classinterval.

• The major disadvantages of mode are: – It may not perfectly denote what central tendency imply, – It does not accommodate further mathematical application, – It is calculated based on few observations, – It may have more than a value for a dataset, – At times a mode value may not exist in a dataset.

100

3/1/2010



26

Skewness and the Measures

of Central Tendency• The normal distribution is one that is bell shaped, unimodal

and symmetric.• Skewness – measures the symmetry of a distribution.• If the distribution is not symmetric, (one side does not reflect

the other), then it is skewed.• Skewness is indicated by the “tail” or trailing frequencies of

the distribution.• If the tail is to the right it is a positive skew. If the tail is to the

left then it is a negatively skewed distribution.• In normal distribution, the mean, median and mode are equal.• Skewness affect their arrangement of the three measures of

the central tendency in the following way.101

Skewness and the Measures

of Central Tendency Cont…

102

Weighted Mean• The weighted mean is similar to an arithmetic mean except it

is a mean where there is some variation in the relativecontribution of individual data values to the mean.

• Each data value (X i) has a weight assigned to it (W i).• Data values with larger weights contribute more to the

weighted mean and data values with smaller weightscontribute less to the weighted mean.

• The formula is

103

Weighted Mean Cont…

• If all the weights are equal, then the weighted mean isthe same as the arithmetic mean.

• The best example for the application of weighted meanis the calculation of GPA.

• Scoring an “A” grade has larger weight than scoring a“B” grade.

104

3/1/2010



27

Geometric Mean

• The geometric mean is an average calculated bymultiplying a set of numbers and taking the n th root,where n is the number of numbers.

• Geometric mean is related to the log-normal distribution.

• The log-normal distribution is a distribution which isnormal for the logarithm transformed values.

105

Harmonic Mean• The harmonic mean (H) of n positive values is defined by the

formula;

• It is the reciprocal of the arithmetic mean of the reciprocals.• It applies more accurately to situations involving rates.• For example: A blood donor fills a 250mL blood bag at

70mL/min on the first visit, and 90mL/min the second visit.What is the average rate at which the donor fills a bag?

• Given: – 250mL at 70mL/min = 3.571 mins total – 250mL at 90mL/min = 2.778 mins total

106

Harmonic Mean Cont…• So 500mL total in (3.571+2.778) mins total = 500/6.349 =

78.753 mL/min• The harmonic mean of 2/[1/70+1/90] = 78.750 gives a more

accurate description of average rate, than the arithmetic mean(80mL/min).

• Source: http://wiki.answers.com/Q/What_is_the_application_of_harmonic_mean_in_medicine

107

Measures of Dispersion

• While measures of central tendency are used to estimate"center" value of a dataset, measures of dispersion areimportant for describing the spread of the data, or itsvariation around a central value.

• Two distinct samples may have the same mean ormedian, but completely different levels of variability, orvice versa.

– Set 1: 30, 40, 40, 50, 60, 60, 70 (Mean = 50) – Set 2: 48, 49, 49, 50, 50, 51, 53 (Mean = 50)

108

3/1/2010



28

Range

• Defined as the difference between the largest andsmallest sample values ( x max -x min).

• Major advantage: It is simple to calculate.• Major disadvantages:

– It depends only on extreme values and provides noinformation about how the remaining data is distributed.

– The range value can not be used when the units of

measurements are different. – The extreme values are the most unreliable parts of the

data. – It doesn’t accommodate further mathematical

application. 109

Standard Deviation and

Variance• Standard deviation is the most common and useful

measure of dispersion.• It is the average distance of each score from the mean.• The formula for sample standard deviation is given as:

• The formula for population standard deviation is give as:

• What might be the reason for the difference?

1

)(1

2

−

−= =

n

x xS

n

ii

n

xn

ii

=

−= 1

2)( µ σ

110

Standard Deviation andVariance Cont…

• Variance is just the square of the standard deviation.• The formulas for sample and population variance are

given as follows:

• NB: Occasionally, the abbreviations SD for standarddeviation and Var for variance are used.

• Standard deviation for grouped data is calculated as:

1

)(1

2

2−

−

==

n

x x

S

n

i

i

n

xn

i

i

=

−

= 1

2

2

)( µ

σ

21

2

1 x

n

m f S

K

iii

−−

= =

111

Standard Deviation andVariance Cont…

• Advantages: – They accommodate further mathematical

applications. – They are calculated from the whole observations.

• Disadvantages: – They must always be understood in the context of the

mean of the data. – They are measured in the unit of measurement of the

observed data. Thus it is difficult to compare thestandard deviation/variance of two datasetsmeasured in two different units.

112

3/1/2010



29

Coefficient of Variation (CV)

• The standard formulation of the CV is the ratio of thestandard deviation to the mean of a give data.

• The coefficient of variation is a dimensionless number.• So when comparing between data sets with different

units one should use CV instead of SD.

• The CV is useful in comparing the variability of severaldifferent samples, each with different arithmetic mean ashigher variability is expected when the mean increases.

• CV is also important to compare reproducibility ofvariables.

%100 x xS

CV =

113

Example on Grouped Data

Example 3.4:• Consider the time taken by 30 students to do a

Biostatistics quiz. Their time is summarized in thefollowing table.

Minutes spent on Quiz Number of students (f)

1-5 2

6-10 1211-20 16

Total 30

114

Example Cont…

Minutesspent on

Quiz

Number of students (f)

True Classinterval

Mid point(m)

f im i f im i2

1-5 2 0.5-5.5 3 6 186-10 12 5.5-10.5 8 96 768

11-20 16 10.5-20.5 15.5 248 3844Total 30 350 4630

minutes11.7=350/30==n

f m Mean

K

iii

min10.8=(5/16)+10.5)2(~ =

−+= w

F

F n

L X m

c

m

min6.55= 64.11629

46301

21

2

−=−−

= = xn

m f S

K

iii

115

Measures of Position (Fractiles)

• In addition to measures of central tendency anddispersion, measures of position give additionalinformation about a given data.

• Fractiles (Quantiles) are numbers that partition, or divide,an ordered dataset into equal parts.

• For instance, the median is a fractile because it dividesan ordered data set into two equal parts.

• The commonly used measure of positions are Quartiles(that divide the data into 4 parts), Deciles (that divide thedata into 10 parts), and Percentiles (that divide the datainto 100 parts).

116

3/1/2010



30

Quartiles

• Quartiles divide a data set into four equal parts.• The three quartiles Q 1, Q 2, and Q 3 divide an ordered data

set into four equal parts. – About ¼ of the data falls on or below the first quartile Q 1. – About ½ of the data falls on or below the second quartile

Q2 (equivalent to median). – About ¾ of the data falls on or below the third quartile

Q3.

– About ¼ of the data falls above the third quartile Q 3.

117

Quartiles Cont…

• In order to identify the Quartiles of a given dataset• Sort the values in increasing order• Identify the Quartiles accordingly;

– Q 1 is the {0.25 (n+1)} th observation – Q 2 is the median observation or {0.5 (n+1)} th

– Q 3 is the {0.75(n+1)} th observation• NB: if the identified observation is not a whole number

then it should be determined by interpolation of theobservations on either side.

118

Quartiles Cont…

• Example: Let’s assume the following dataset presents theage of 8 factory workers. Identify the first and the thirdquartiles.{18, 21, 23, 24, 24, 32, 42, 59}

• First make sure that the data is sorted in increasing order.• Q1 is the {0.25 (n+1)} th observation

{0.25 (8+1)} th observation{0.25 (9)}th observation{2.25}th observation

119

Quartiles Cont…

• i.e. the Q 1 is a quarter distance between 21 and 23 thiscan be interpolated as:

21 + (23-21)0.25 = 21.5• The interpretation is one forth of the observations are

below or equal to the value 21.5.• Q3 is the {0.75(n+1)} th observation

{6.75}th observation32 + (42-32)0.75 = 39.5

• The interpretation is three forth of the observations arebelow or equal to the value 39.5.

120

3/1/2010



31

Quartiles Cont…

Additional use of the quartiles:• The inter quartile range (Q 3- Q1) can be used as

measure of dispersion (like that of Range). Inter quartilerange can over come one of the limitations of range, (i.e.being affected by extreme values).

• Quartile deviation [(Q 3- Q1)/2] and Coefficient of quartiledeviation [(Q 3- Q1)/(Q3+ Q1)] are also rarely used asmeasures of dispersion.

• A dataset can be summarized using the so called “Thefive numbers summary” (this is sometimes representedgraphically as a box-and-whisker plot). The five numbersare: the first and third quartiles, the median, and themaximum and minimum values. 121

Deciles

• Deciles serve to partition data into10 equal parts.• Not commonly used as common as percentiles and

Quartiles.• There are 9 deciles dividing the population into 10 parts.• The deciles are termed D 1 through D 9.• The interpretation of Deciles is as follows:

– About one tenth of the data falls on or below D 1.

– About two tenth of the data falls on or below D 2. – The same meaning for other deciles.

• Note that the D 5 has similar meaning to the median orthe third quartile.

122

Deciles Cont…

A given percentile is determined in the following manner;1. Arrange the data in ascending order.2. Compute the decile using the formula:

3. NB: if the identified observation is not a whole numberthen it should be determined by interpolation of theobservations on either side.

nobservationk

decilek th

th += )1)(10

(

123

Percentiles

• Percentiles are also like quartiles, but divide the data setinto 100 equal parts.

• Each group represents 1% of the data set.• There are 99 percentiles termed P 1 through P 99.

• P 50 is yet another term for median.• Other equivalents, such as P 25=Q 1, P 75=Q 3, P 10=D 1, etc. ,should also be obvious.

• The interpretation of Percentiles is as follows: – 1% of the data falls on or below P 1. – 2% of the data falls on or below P 2. – The same for other values.

124

3/1/2010



32

Percentiles Cont…

A given percentile is determined in the following manner;1. Arrange the data in ascending order.2. Compute the percentile using the formula:

3. If the identified observation is not a whole number thenit should be determined by interpolation of the

observations on either side.

nobservationk

percentilek th

th += )1)(100

(

125

Example

• The following data represents the Biostatistics result of18 students out of 100 marks. Calculate the 4 th decileand 70 th percentile.{72, 51, 59, 80, 84, 71, 82, 71, 51, 48, 66, 81, 78, 69, 75,67, 76, 75}

• Computing the 4 th decile• Before starting the computation arrange the observations

in increasing order. i.e.{48, 51, 51, 59, 66, 67, 69, 71, 71, 72, 75, 75, 76, 78, 80,81, 82, 84}

• Compute 4 th decile using the formula:126

Example Cont…

• Compute 4 th decile using the formula:

4th decile is b/n the 7 th & 8th observation (i.e. b/n 69 & 71)In order to get the exact value we have to interpolate69 + (71-69) 0.6 = 70.2About four tenth of the data falls on or below 70.2

nobservationdecileth

th += )1)(104

(4

[ ] nobservatiodecile thth )19)(4.0(4 =

[ ] nobservatiodecile thth )6.7(4 =

127

Example Cont…

• Compute the 70 th percentile• The data is already sorted• Compute the 70 th percentile using the formula

70 th percentile is b/n the 13 th & 14th observation (i.e. b/n76 & 78).In order to get the exact value we have to interpolate76 + (78-76) 0.3 = 76.6About 70% of the data falls on or below the value 76.6.

nobservation percentileth

th += )1)(10070

(70

[ ] nobservatio percentile thth )3.13(70 =

128

3/1/2010



33

Rate, Ratio and Proportion

• In addition to measures of central tendency, measures ofdispersion, and measures of position, a dataset can bemathematically summarized by the use of Rate, Ratioand Proportion.

129

Rate

• In mathematics rate is a numeric presentation which isgiven in the form of fraction by which the numeratormeasures one variable and the denominator another.

• Usually the denominator of rate is a time measure.• In epidemiology we use rates to measure the occurrence

of events over time.• If time element is directly reflected into the denominator

it is called real rate. (Example: Incidence density).• If the fraction measures number of events per population

at risk in a given period of time it is called operationalrate (Example: Incidence proportion).

130

Ratio

• Mathematically a ratio is the comparison of twoquantities that have the same units (usually classes of avariable).

• A ratio can be written in three different ways: – As two numbers separated by a colon (a:b) – As a fraction (a/b) – As two numbers separated by the word to (a to b)

• In epidemiology ratio present two variables (asnumerator and denominator) where one is not includedin the other.

131

Proportion

• A proportion is usually presented in fraction, decimal orpercentage.

• Unlike ratio numerator is the subset of the denominator,hence the value indicates the overall contribution of thenumerator to the denominator.

132

3/1/2010



34

Numeric Summarization

Using SPSS• In SPSS numeric summaries are available under manyalternatives. Commonly used are: – Analyze > Descriptive statistics > Frequency >

Statistics. – Analyze > Descriptive statistics > Descriptives >

Statistics. – Analyze > Descriptive statistics > Cross tabs >

Statistics. – Analyze > Descriptive statistics > Explore > Statistics. – Analyze > Reports > OLAP Cubes > Statistics.

133

Basic Probability

What is Probability

• Probability is the chance that an event will occur giventhe trial has been conducted nearly infinitely under thesame condition. OR

• The probability of an event is the relative frequency ofset of outcomes over indefinitely large (or infinite)

number of trials.• A sampling space is the set of all possible outcomes of atrial or experiment.

• Event is the subset of the sample space.• An event can be simple or composite. Composite event

contains more than one simple events.

135

Concept of Union, Intersection andComplement

136

3/1/2010



35

Mutually Exclusive Events and The

Additive Law• Events are said to be mutually exclusive if they have no

outcome in common.

• Examples:

• The Additive Law when applied to two mutually exclusiveevents states that the probability of either of the twoevents occurring is obtained by adding the probability of

each event.

• p(A or B) = p(A) + p(B)

137

Mutually Exclusive Cont..

Example 4.1:• Role a six sided Die. The possible outcomes (Sampling

space) are six (1,2,3,4,5,6). Each event has equalprobability of occurrence (i.e. 1/6). Probability of rollingan even number would be:

• p(even) = p(2)+ p(4)+ p(6)

• = (1/6)+(1/6)+(1/6)=1/2

138

Mutually Exclusive Cont..

Example 4.2:

• The natural history of Tuberculosis indicates for TBpatients without any treatment, at the end of the 5 th yearof illness ½ of them would die, ¼ would developpermanent disability and ¼ would recover. What is theprobability of an untreated TB patient either to recover orto develop permanent disability (in other words to avoiddeath) after 5 years of illness?

139

Conditional Probability and theMultiplicative Law

• Conditional probability is defined as the probability that acertain event will occur given that a composite event hasalso occurred.

• p(A|B) or "probability of A given B"

• This formula is conveniently rewritten as the followingwhich is commonly referred to as the Multiplicative Rule.

p(B)B)p(A

B)|(∩

= A p

)()B|()( B p x A p B A p =∩140

3/1/2010



36

Conditional Probability Cont..

Example 4.3:• What is the probability that the outcome of a roll of a die

is 2 (A2) given that the outcome is even?

Example 4.4:• A medical practitioner measured the CD4 count of AIDS

patient on ART two times with in a month. About 25% of

the patients had normal value in both tests and 42% ofthem had normal result in the first test. What percent ofthose who had normal value in the first test also have thesame in the second test?

141

Independent Events and the

Multiplicative Law• For two given events, if the occurrence or nonoccurrence

of one doesn’t affect in any way the occurrence ornonoccurrence of the other, the events are calledindependent events.

• With independent events the multiplicative law becomes:p(A and B) = p (A)p(B)

142

Independent Events Cont..

Example 4.5:• Assume we have rolled a die twice. What is the

probability to get 6 in both rolls?

Example 4.6:• The probability of getting normal birth weight baby at 33 rd

weeks gestational age is 1/5. If two pregnant women atthe aforementioned gestational age gave birth in BethelHospital yesterday, what is the probability for those twobabies to have normal birth weight?

143

Bayes' Theorem• Bayes' theorem, was published in the eighteenth century

by Thomas Bayes’.• It says that you can use conditional probability to make

predictions in reverse.• Sometimes called the inverse probability law:

• P(B|A) = P(A and B)/P(A) ………………………………1P(A|B) = P(A and B)/P(B) ………………………………2• Solving [1] for P(A and B) and substituting into [2] gives

Bayes' Theorem:

P(A|B) = [P(B|A)][P(A)]/P(B)• The general formula for Bayes' Theorem is:

144

3/1/2010



37

Bayes' Theorem Cont…

Example 4.7:• Suppose there is a certain disease randomly found in

0.005% of the general population. A certain clinical bloodtest is 99% effective in detecting the presence of thedisease among persons with the disease. But it alsoyields false-positive results in 5% of individuals withoutthe disease. The following tables show the probabilitiesthat are stipulated in the example and the probabilities

that can be inferred from the stipulated information:

• (Source: http://faculty.vassar.edu/lowry/bayes.html)

145


P (A) = .005 The probability that the disease will be present in anyparticular person

P (~A) = 1—.005 = .995 The probability that the disease will not be present inany particular person

P (B|A) = .99 The probability that the test will yield a positive result[B] if the disease is present [A]

P (~B|A) = 1—.99 = .01 The probability that the test will yield a negative result[~B] if the disease is present [A]

P (B|~A) = .05The probability that the test will yield a positive result[B] if the disease is not present [~A]

P (~B|~A) = 1—.05 = .95 The probability that the test will yield a negative result[~B] if the disease is not present [~A]

Given:

146


P (B) = [P (B|A) x P (A)] + [P(B|~A) x P (~A)]= [.99 x .005]+[.05 x .995] = .0547

The probability of a positive test result[B], irrespective of whether the diseaseis present [A] or not present [~A]

P (~B) = [P (~B|A) x P (A)] + [ P(~B|~A) x P (~A)]= [.01 x .005]+[.95 x .995] = .9453

The probability of a negative test result[~B], irrespective of whether thedisease is present [A] or not present[~A]

• Given this information, the derivation of two simpleprobabilities is possible using conditional probabilityformula.

147


P (A|B) = [P (B|A) x P (A)] / P(B)= [.99 x .005] / .0547 = .0905

The probability that the disease is present [A] ifthe test result is positive [B]

P (~A|B) = [P (B|~A) x P (~A)] / P(B)

= [.05 x .995] / .0547 = .9095The probability that the disease is not present[~A] if the test result is positive [B]

P (~A|~B) = [P (~B|~A) x P (~A)] / P(~B)= [.95 x .995] / .9453 = .99995

The probability that the disease is absent [~A] ifthe test result is negative [~B]

P (A|~B) = [P (~B|A) x P (A)] / P (~B)= [.01 x .005] / .9453 = .00005

The probability that the disease is present [A] ifthe test result is negative [~B]

• Then it is possible to calculate the remainingprobabilities.

148

3/1/2010



38

Summary of the Basic Properties of

Probability1. The value of a probability can only be 0 p 1.2. If an event is certain to occur, its probability is 1 and if an

event is certain not to occur, its probability is 0.3. If two events are mutually exclusive (disjoint), the

probability that one or the other will occur equals thesum of the probabilities: p(A or B) = p(A) + p(B)

4. If A and B are two events, not necessarily disjoint, thenp(A or B) = p(A) + p(B)-p(A and B)

5. The sum of the probabilities that an event will occur andthat it will not occur is equal to 1.6. If A and B are two independent events then p(A and B) =

p(A)p(B)7. p(A|B) = P (AnB)/P(B)

149

Random Variable and ProbabilityDistribution

Random Variable

• Any characteristic that can be measured or categorizedis called Variable.

• If a variable can assume a number of different values sothat any particular outcome is determined by chance, it iscalled a Random Variable.

• A Random Variable is a function, which assigns uniquenumerical values to all possible outcomes of a randomexperiment under fixed conditions.

151

Random Variable Cont…

Example 4.8• Three students are taken

at random from thisclassroom. Suppose ourinterest is the number offemale students that wewill get out of the threesamples. The possible listof outcomes with numberof females is:

Outcome No ofFemales

MMM 0

MMF 1

MFM 1

FMM 1

MFF 2

FMF 2FFM 2

FFF 3

152

3/1/2010



39

Random Variable Cont…• There are two types of random variables.

– A Continuous Random Variable is one that takes aninfinite number of possible values; and,

– A Discrete Random Variable: is one that takes finitedistinct values.

• Example 4.9: – A coin is tossed 10 times. The random variable X is the

number of tails that are noted. X can only take the values

0, 1, ..., 10, so X is a Discrete Random Variable. – A light bulb is burned until it burns out. The random

variable Y is its lifetime in hours. Y can take any positivereal value, so Y is a Continuous Random Variable.

153

Probability Distributions

• Every Random Variable has a corresponding ProbabilityDistribution.

• A Probability Distribution applies the theory of probabilityto describe the behavior of the random variable.

• In the discrete case, it specifies all possible outcomes ofthe random variable along with the probability that each

will occur.

• In the continuous case, it allows us to determine theprobabilities associated with specified ranges of values.

154

Discrete Probability Distribution

• Usually represented bytable.

Example 4.10:• Table 4.1: Probability

Distribution of a randomvariable X representingthe birth order of childrenborn in US.

x P(X=x)1 0.4162 0.3303 0.1584 0.0585 0.0216 0.0097 0.004

8+ 0.004Total 1.000

155

Continuous ProbabilityDistributions

• Since a continuous random variable assumes infinitenumber of outcomes, it cannot be expressed in tabularform. Instead, an equation or graph describes it.

• The equation used to describe a continuous probabilitydistribution is called a Probability Density Function(PDF).

• PDF has the following properties:

156

3/1/2010



40

Continuous Probability

Distributions Cont..• The area bounded by the curve of the density functionand the x-axis is equal to 1, when computed over thedomain of the variable.

• The probability that a random variable assumes a valuebetween a and b is equal to the area under the densityfunction bounded by a and b.

• The probability that a continuous random variable willequal a specific value is always zero.

157

Binomial Distribution

• A discrete probability distribution.• It handles dichotomous /binary/bernoulli random

variable.• A variable which has only two outcomes (Success and

failure).• The trial is called Bernoulli trial.

– The experiment consists of n repeated trials.

– Each trial can result in just two possible outcomes. – The probability of success (x), denoted by P, is thesame on every trial.

– The trials are independent.158

Binomial Distribution Cont..

• b(x; n, P): The probability that an n-trial binomialexperiment results in exactly x successes, when theprobability of success on an individual trial is P.

• b(x; n, P) = nCx * Px * (1 - P)n – x

159


Example 4.11:• Suppose a die is tossed 5 times. What is the probability

of getting exactly 2 fours?• Suppose in Addis Ababa the probability of a commercial

sex worker to be HIV positive is 0.15. If we consider 5randomly selected commercial sex workers in the city,what is the probability that exactly 2 prostitutes will bepositive?

160

3/1/2010



41


Cumulative Binomial Probability:• Refers to the probability that the binomial random

variable falls within a specified range (e.g., is greaterthan or equal to a stated lower limit and less than orequal to a stated upper limit).

161

Binomial Distribution Cont…

Example 4.12:• The probability that a student is accepted to a

prestigious college is 0.3. If 5 students from the sameschool apply, what is the probability that at most 2 areaccepted?

• What is the probability of getting 4 or more HIV positivesamong 5 randomly selected sex workers given that theprobability of a commercial sex worker to be HIV positiveis 0.15?

162

Poisson Distribution

• A discrete probability distribution.• First introduced by Siméon-Denis Poisson (1781–1840)• It expresses the probability of a number of random

events occurring in a fixed period of time if these eventsoccur with a known average rate.

• A Poisson experiment is a statistical experiment that hasthe following properties:

163

Poisson Distribution Cont…

– The experiment results in outcomes that can beclassified as successes or failures.

– The average number of successes ( ) that occurs in aspecified period is known.

– The probability that a success will occur isproportional to the duration of the time.

– The probability that a success will occur in anextremely small time is virtually zero.

• Note that the distribution can also be used to quantify theprobability of occurrence of an event in a length, an area,a volume, etc.

164

3/1/2010



42


• The following notations are important, – e: A constant equal to approximately 2.71828. – : The mean number of successes (occurrence

of an event) that occur in a specified period oftime.

– x: The actual number of successes that occur ina specified period of time.

– P(x; ): The Poisson probability that exactly xsuccesses occur in a Poisson experiment,when the mean number of successes is .

165


• Given the mean number of successes ( ) that occur in aspecified period of time, we can compute the Poissonprobability based on the following formula:

P(x ; ) = (e - ) ( x) / x!

Example 4.13:• Let’s assume the average number of breast cancer

cases death is 2 per day. What is the probability thatexactly 3 will die tomorrow?

• = 2; since 2 patients die per day, on average.• x = 3; i.e. likelihood that 3 will die tomorrow.• e = 2.71828; 166


• We put these values into the formula as follows;P(x ; ) = (e - ) ( x) / x!

P(3; 2) = (2.71828 -2) (2 3) / 3!P(3; 2) = (0.13534) (8) / 6P(3; 2) = 0.180

• Thus, the probability of getting 3 deaths by tomorrow is0.180.

167


Example 4.14:• In a study of suicides, a researcher found that the

monthly distribution of adolescent suicides in US followsa poisson distribution with parameter of = 2.75. Find theprobability that a randomly selected month will be one in

which three adolescent suicides occur.• P(x ; ) = (e - ) ( x) / x!• P(3 ; 2.75) = (e -2.75 ) (2.75 3) / 3!• P(3 ; 2.75) = 0.222

168

3/1/2010



43


• If the number of admissions in a hospital is 10 per houron average, determine the probability that, in any hourthere will be:

0 admissions;6 admissions;Less than 2 admissions.

169

Normal Distribution

• Is the most important probability distribution function.• It is also known as the Gaussian Distribution.• Named after Carl Friedrich Gauss (1777–1855).• Given by the formula:

• The formula is affected by two main factors: mean andSD

2

2

2

)(

*]2*)1

[( σ µ

π σ

−−

= x

eY

170

Normal Distribution Cont…

Normal distribution has the following chx:1. Bell shaped2. Symmetrical at the mean3. Unimodal

4. Mean median and mode are equal5. Area under the curve is 16. Extends from negative infinity to positive infinity

• The normal distribution can be used to describe, atleast approximately, any variable that tends to clusteraround the mean. (Mainly as result the central limittheorem) 171

Skewness, Kurtosis, andNormal Curve

• Skewness and kurtosis are used to measure normality.• Significant skewness and kurtosis indicate that data are

not normal.• Skewness is a measure of asymmetry.• For univariate data Y 1, Y 2, ..., Y N , the formula for

skewness is:

• Where Y bar is the mean, S is the standard deviation,and N is the number of data points.

• The skewness for a normal distribution is zero, and anysymmetric data should have a skewness near zero. 172

3/1/2010



44

Skewness, Kurtosis Cont…

• Kurtosis is a measure of whether the data are peaked orflat relative to a normal distribution.

• For univariate data Y 1, Y 2, ..., Y N , the formula for kurtosisis:

• The kurtosis for a normal distribution is three.• For this reason, some use the following definition of

kurtosis (often referred to as "excess kurtosis"):

• Positive kurtosis indicates a "peaked" distribution andnegative kurtosis indicates a "flat" distribution. 173

Normality Test

• Normality tests assess the likelihood that the given dataset comes from a normal distribution.

• It is important aspect statistics as many proceduresassume normality.

• Typically the null hypothesis H 0 is that the observationsare distributed normally with unspecified mean andvariance 2.

• The alternative H a that the distribution is arbitrary.• A great number of tests (over 40) have been devised for

this problem, the more prominent of them are outlinedbelow:

174

Normality Test Cont…

• The simplest method of assessing normality is to look atthe frequency distribution histogram. (symmetry,peakiness of the curve, modality of the distribution).

• The other option is the use of probability plots.• Probability Plot Is a graphical technique for comparing

two datasets, either two sets of empirical observations,one empirical set against a theoretical set, or twotheoretical sets against each other. – It is a common way of assessing normality, i.e. by

comparing a given data against normal distribution. – Has two variants; Q-Q plot and P-P plot.

175


• Quantile-Quantile Plot (Q-Q plot): – Compares two probability distributions by plotting

their quantiles against each other. – If the two distributions being compared are similar,

the points in the Q-Q plot will approximately lie on theline y = x.

• Probability-Probability plot (P-P plot): – Compares two probability distributions by plotting

their cumulative distribution functions against eachother.

176

3/1/2010



45


• It is possible to assess normality of a data objectivelyusing statistical techniques. (Example: Kolmogorov-Smirnov test, Shapiro-Wilk test).

• In SPSS:• Analysis > descriptive statistic > explore > enter the

variable under dependent list > open plot and “check“normality plots with test” > continue > ok.

• But such tests have serous limitation as: – Small samples almost always pass a normality test, – With large samples minor deviations from normality

may be flagged as statistically significant.177

Normal Distribution Cont…Application of Normal distribution to calculate probability:1. Area under the curve is 1,2. Probability of x > a is the area between a and positive

infinity,3. Probability of x < a is the area between a and negative

infinity,4. Probability of b<x<a is the area between a and b,5. Probability of x = a is zero,

6. The empiric rule of 68%, 95% and 99.7% rule.

But how can we compute the area???

178

Standard Normal Distribution

• Is a normal distribution with a mean of 0 and a standarddeviation of 1.

• Any point (x) from a normal distribution can be convertedto the standard normal distribution (Z) with the formula:

Z = (x-mean)/standard deviation.

• Corresponding area can be calculated from a standardtable.

179

Standard Normal DistributionCont..

Example 4.15:• if 1.4m is the height of a student where the mean for

students of his age and sex is 1.2m with a standarddeviation of 0.4. – What is the corresponding Z value for the student? – What is the probability to have a student more than

height of 1.4?

180

3/1/2010



46


Cont..Example 4.16:• Assume a distribution of blood glucose level among

medical students is normally distributed with mean of90mg/dl and SD of 6mg/dl. Student X has mean glucoselevel of 100mg/dl. Another student Y has mean glucoselevel of 80mg/dl. – What is the Z score for student X? – What is the Z score for student Y? – What is the probability of getting mean glucose level

less than 100mg/dl ? – What is the probability of getting mean glucose level

less than 80mg/dl ?

181


Cont.. – What range around the mean which encompasses68% of the observation?

– What is the probability for a student to have bloodglucose level between 100 and 105 mg/dl?

182


Example 4.17:• Among pregnant women having ANC follow-up in a

hospital, WBC count follows normal distribution withmean of 8,000 and standard deviation of 800. – What is the probability to get WBC more than 10,000

in those pregnant women? – What is the probability to get WBC count between

7,500 and 10,000?

183


1. Suppose in BL Hospital the probability of a donated blood to bepositive to Hepatitis B is 0.2. If we consider 4 randomly selecteddonated bloods, what is the probability that exactly 2 of thesamples will be positive for Hepatitis B?

2. Suppose that systolic blood pressures follow a normal distributionwith a mean of 108 and a SD of 14. According to this informationattempt the following questions. – About 95% of the blood pressures are between ____ & ____. – About ______% of the blood pressures are between 66 & 150 – What is the probability that a patient’s BP is > 120? – What is the probability that the patient’s BP is b/n 110 & 130? – What is the probability that a patient’s BP is < 108.

184

3/1/2010



47

Introduction to DemographicMethods and Health Service

Statistics

What is Demography?

• “Demos” + “graphy”• Is a discipline that studies human population with respect to

size, composition, distribution, mobility and its variation withrespect to all the above features and the causes of suchvariations and the effect of all these on health,environmental, social, ethical and economic conditions.

• Demography as a “method” and “data”.• Demography studies a population in “static” and “dynamic”

aspects.• Static aspects include characteristics at a point in time such

as composition by Age, Sex, Race, Marital status etc.• Dynamic aspects are Fertility, Mortality, Nuptiality, Migration

and Growth.186

Source of Demographic Data

• Demographic data can be acquired through threemethods:

– Census – Survey – Vital Registration

187

Census

• Refers to the total process of collecting, compiling,analyzing, and publishing or otherwise disseminatingdemographic, economic, and social data pertaining to allpersons in a country or in a well-delineated part of acountry at a specified time.

• Census has the following characters: – Universality – Simultaneity – Individual enumeration – Regular interval

188

3/1/2010



48

Census Cont..

• The first real census was conducted in UK in 1841.• However there are evidences of large scale counting of

population starting from the prehistoric period.

• Content of Census – Demographic data – Economic data

– Social data – Mortality and Birth

189

Approaches to Census

De jure:• The enumeration is according to the legal or customary

place of residence.• i.e. people are registered where they usually reside.• Such type of counting gives information relatively

unaffected by seasonal and temporary movements.• However, this might not be accurate when a person’s

legal or customary residence is not known.• It also creates risk of omission and double counting.• Information collected from a person away from his/her

usual residence can also be incomplete.190

Approaches Cont…

De facto:• The enumeration is according to physical residence at

the time of the census.• i.e. people are registered where they are currently

staying/residing at the time of the census.• This method is advantageous in a sense that it has got

less chance of double counting or omission.• However, if it is applied in areas where there is high level

of migration and mobility, the result can be distorted.

191

Advantage and Disadvantage ofCensus

• Advantage – It represents the whole population, – Serves as sampling frame for further studies, – Provides population denominators, – Provides small area data.

• Disadvantage – Size limits content and quality control efforts, – Cost limits frequency, – Delay between field work and results, – Sometimes politicized.

192

3/1/2010



49

Vital Registration (Civil Registration)

• Vital Registration is continuous registration of vitalevents as they happen.

• What are the vital events?• Vital Registration is relatively modern concept in its

present format.• The major purpose of vital registration is primarily

administrative.• Vital Registration has got the following features:

– Continuity – Universality

193

Advantages of Vital Registration

• Continuously monitors vital rates,• May provide both numerator and denominator for

some rates,• Small area data available,• Can be used as base for testing the accuracy of

censuses and surveys,• Once a system is established, it would be cost

effective.

194

Disadvantages of Vital Registration

• Uncertain coverage,• It is difficult to establish the system,• Information may come from third party,• It can easily be disrupted by political/economic events.

195

Survey

• Refers to the process of obtain information from asample representative of some population at a givenpoint in time.

• How can we make it representative?• Survey can be of two types:

– Single rounded retrospective survey – Multi-round follow up survey

• The content of survey widely varies.• Features of Survey:

– Representativeness, – Smaller size – More in-depth information.

196

3/1/2010



50

Advantage and Disadvantage ofSurvey

• Advantages: – Quick and inexpensive, – Gives detailed data, – Follow up can be achieved

• Limitations: – Small area data might not be available,

– Perfect representativeness is difficult to achieve, – A survey can only be focused on few thematic areas.

197

Demographic Transition

• Conceptual framework to explain population change overtime.

• Developed by American demographer WarrenThompson, 1929.

• Observed changes in birth and death rates inindustrialized societies over the past two hundred years.

• Demographic change has got three stages.• Developed countries started the second stage in the

beginning of eighteenth century. Less developedcountries began the transition later.

198

Demographic Transition Cont…

199

Demographic TransitionCont…

• Stage I: Characterized by high and fluctuating mortality,high fertility and low population growth.

• Stage II: Characterized by beginning of mortality declinefollowed by fertility decline. This is the period of rapidpopulation growth.

• Stage III: Characterized by low mortality, low andfluctuating fertility, growth slows down and eventuallyreaches a no-growth stage.

200

3/1/2010



51

Important Indicators of Compositionof a Population

1. Sex Ratio : Is the total number of male population per1000 female population. This can be explained as Y to1000, Y:1 or Y/X when Y is number male and X isnumber of female.

2. Child to Women Ratio : This is the ratio of number ofchildren under five to number of women of reproductiveage in given place and time. It can also be used asmeasure of fertility.

3. Dependency Ratio : Describe the ratio between nonproductive (age 0-14 and 65+) and productive (15-64)age groups in given place and time.

4. Population Pyramid:201

Population Pyramid

• A graphical illustration that shows the distribution ofvarious age groups in a population.

• Normally forms the shape of a pyramid.• Consists of two back-to-back bar graphs, with the

population plotted on the X-axis and age on the Y-axis,• One showing the number of males and one showing

females in a particular population in five-year agegroups.

• Males are shown on the left and females on the right.

202

Population Pyramid

203

Population Pyramid

204

3/1/2010



52

Vital Statistics

• Among the focus of demography, some of the issues aremore important and applicable in public health.

• Especially the measures of mortality and fertility are vitalinputs to the health system so they are called VitalStatistics.

205

Measures of Fertility

• Crude Birth Rate (CBR) : The number of live births in ayear per 1000 mid year population in the same year.

1000 x year sameain population year Mid year ainbirthsliveof number Total

CBR =

206

Measures of Fertility Cont..

• General Fertility Rate (GFR): The number of live birthsin a year per 1000 mid year women of reproductive age.

10004915

x year sameain yrsaged population female year Mid

year ainbirthsliveof number TotalGFR

−=

207


• Age Specific Fertility Rate (ASFR ): Refers to thenumber of live births in a year per 1000 women ofreproductive age in a give age or age group.

• Usually ASFR is calculated for the following 7 age

groups of 5 years age category: 15-19 yr, 20-24 yr, 25-29 yr, 30-34 yr, 35-39 yr, 40-44 yr, 45-49 yrs.

1000 x year sametheingroupagesamethe for population female year Mid

year aduringgroupagegivenaof womentobirthsliveof noTotal ASFR =

208

3/1/2010



53


Age category ASFR

15-19 104

20-24 228

25-29 241

30-34 231

35-39 160

40-44 84

45-49 34209


• Total Fertility Rate (TFR): The number of children awoman expected to have at the end of her reproductiveage given the current ASFRs are maintained.

• Mathematically, it is the sum of all ASFRs from 15-49yrs.

• TFR for data given in the usual 5 years age category isprovided as:

=

=7

1

5i

i ASFR xTFR

210


• Gross Reproduction Rate (GRR) : Is the total fertilityrate restricted to female births only.

1000Pr xbirths femaleof oportion xTFRGRR =

211


• Child Ever Born (CEB) :• Total number of children a woman has ever given birth

to.• It is the average number of children a woman has in a

given study area.

212

3/1/2010



54


Example 5.1:• Calculate ASFR, TFR, GFR, CBR from the following

data.

213

Measures of Fertility Cont..Age category Women of

reproductive ageLive

birthsASFR

15-19 15,600 159620-24 14,400 330025-29 13,300 321030-34 12,200 283035-39 11,600 1860

40-44 10,100 85045-49 9,200 320Total 86,400 13,966

214

Measures of Mortality

• Crude Death Rate (CDR): Refers to total number ofdeaths in a given area usually in a year per 1000 midyear population.

1000 x population year Mid

year per deathof number TotalCDR =

215


• Age Specific Death Rate (ASDR): Quantifies deathoccurring in defined age category in a given area per1000 mid year population of same age category.

1000 x

year sametheincategoryagethat of population year Mid

year aincategoryagegivenaindeathof No ASFR =

216

3/1/2010



55

Measures of Mortality• Neonatal Mortality Rate (NMR): It refers to number of

death before the age of 28 days (neonatal period) in ayear out of 1000 live births in the same year.

• Infant Mortality Rate (IMR): It refers to number of deathbefore the age of 1 year (Infancy period) in a year out of1000 live births in the same year.

• Under Five Mortality Rate (U5MR): Quantifies theprobability of dying between birth and age five per 1000live births in a given year.

• Child Death Rate (ChDR): Quantifies the probability ofdying between age of one and five years per 1000 livebirths in a given year. 217


• Cause Specific Mortality Rate (CSMR):

• Cause Specific Death Ratio (ProportionateMortality Ratio):

1000sec

xrisk at Population

year aincausegivenatoondarydeathof NoCSMR =

1000sec

Pr x year sametheindeathof noTotal

year aincauseatoondarydeathof No Ratio Mortalityeoportionat =

218


• Maternal Mortality Ratio:

• Maternal Mortality Rate:

100000 x year sametheinbirthsliveof number Total

year givenaindeathmaternalof Number MMR o =

100000 x year sametheinagevereproductiof womenof number Total

year givenaindeathmaternalof Number MMRa =

219

Measures of Migration

• Crude In-Migration Rate : Number of in-migrants (I)per 1,000 population in a given year.

• Crude Out-Migration Rate : Number of out-migrants(O) per 1,000 population in a given year.

• Crude Net Migration Rate : Difference between thenumber of in-migrants (I) and number of out-migrants(O) per 1000 population in a given year.

220

3/1/2010



56

Measures of Marriage

• Crude Marriage Rate: Number of marriage (M) per1000 population in a given year.

• General Marriage Rate: Number of marriage (M) per1000 population age 15 and older in a given year.

221

Measure of Population Growthand Projection

• Crude Rate of Natural Increase (r):

• Population Projection:

• Population Doubling Time:

CDRCBRr −=

t ot r PP )1( +=

)1(log2log

r t

+=

222

Health Service Statistics

• Data generated from the health system itself.• Advantages:

– Gives morbidity information – Identify priority health problem in the area.

– Determine met and unmet health need. – Determine success or failure of specific

health care program. – Assess utilization of health service.

223

Health Service Statistics Cont..

• Limitations – Lack of completeness – Lack of representativeness to the general

community

– Lack of denominators – Lack of uniformity – Lack of quality – Lack of compliance with reporting

224

3/1/2010



57

Health Service Statistics Cont..

1. Relative Frequency of a Disease:

2. Cure Rate:• Quantifies proportion of patients who have been cured

for a disease condition using a treatment modality out of100 patients who received similar type of treatment.

• The term “Success Rate” can be used if the measuredparameter is a procedure.

%100diseasegivenaof FrequencyRelative xvisitsninstitutiohealthof number Total

diseasespecificawithdiagnosed patientsof No=

%100modsin

xtreatment therecieved who patientsof Number

alitytreatment agudiseasegivenaof patientscured of No RateCure =

225

Health Service StatisticsCont..

3. Admission Rate:• Quantifies proportion of admissions of patients among

patients who visited the health institution in a givenperiod of time.

4. Hospital Death Rate:• Quantifies proportion of deaths among hospitalized

patients in a given period of time.

226

%100 xninstitutiothevisited patientsof number Totalninstitutiohealthatoadmitted patientsof No

Rate Admission =

%100 xadmissionof noTotal

patientsed hospitalizamongdeathof No Rate Dealth Hospital =

Health Service StatisticsCont..

5. Bed Occupancy Rate:• Quantifies percentage occupancy of hospital beds in a

year.

6. Average Length of Stay:• Quantifies the average duration (in days) of hospitalized

patients.

227

deathsor esdiscof Number days patient ed hospitalizof number Annual

ALSarg

=

%100365

xbedsof number total x

days patient ed hospitalizof number Annual BOR =

Sampling Method

3/1/2010



58

Why Sampling?

• Sampling is that part of statistical practice concerned withthe selection of individual observations intended to yieldreasonable knowledge about a population of concern,especially for the purposes of statistical inference.

• Study population Vs Target (Source) (Reference)Population.

• Parameter: A descriptive measure computed from the dataof the source population,

• Statistic: A descriptive measure computed from the data ofa sample.• The issues of adequate sample size and representative

sampling technique are important for correct estimation ofthe parameter using a statistic. 229

Why Sampling?

230

Why Sampling?

• Researchers rarely survey the entire population for tworeasons

(1) The cost is too high and(2) The population is dynamic.

• Main advantages of sampling:(1) The cost is lower,(2) Data collection is faster, and(3) It is possible to ensure accuracy and quality ofthe data because the dataset is smaller.

• Main disadvantage of sampling – Non representativeness (sampling error)

231

SamplingImportant terms:• Sampling Unit: Is the unit of selection in the sampling

process.• Study Unit: The unit on which information is collected.• Sampling Frame: The list of all the units in the source

population from which a sample is to be taken.• Sampling Fraction (Sampling Interval): The ratio

between the number of units in the sample to thenumber of units in the source population.

232

3/1/2010



59

Types of Sampling

• Probability Sampling : Every unit in the population hasa known, non-zero probability, of being sampled and theprocess involves random selection.

• Nonprobablity Sampling : Nonprobability sampling isany sampling method where some elements of thepopulation have no chance of selection or where theprobability of selection can't be accurately determined.

233

Probability Sampling

– Simple Random Sampling (SRS) – Systematic Random Sampling – Stratified Sampling – Cluster Sampling – Multistage Sampling

234

A. Simple Random Sampling (SRS)• Is the purest (the most representative) form.

• Each member of the population has an equal, nonzeroand known chance of being selected.

• This could be accomplished by writing each study units

name on a slip of paper and selecting adequatenumber of them using Lottery Method.

• It can also be done by assigning a number to eachsampling unit then samples are selected using Tableof Random Numbers or Computer packages.

235

How to use table of randomnumbers

1. Number each member of the population.2. Determine population size (N).3. Determine sample size (n).4. Determine starting point in table by randomly picking a

page and dropping your finger on the page with youreyes closed.

5. Choose a direction to read. (to the left, right, down or up)6. Select the first n numbers read from the table whose lastdigits are between 0 and N.

7. Once a number is chosen, do not use it again.8. If you reach the end of the table before obtaining your n

numbers, pick another starting point, read in a differentdirection, and continue until done.

236

3/1/2010

Si l R d S li



60

Simple Random SamplingCont…

• When large dataset is available in databases, statisticalpackages can select a given size randomly.

• In SPSS: – Data > Select Cases > Random > complete the

dialogue box accordingly.

• In Excel: – Tools > Data Analysis > Sampling > Complete the

dialogue box accordingly.

237

Simple Random Sampling Cont…

Limitation of SRS• Requires sampling frame,• Takes longer time.

238

B. Systematic Random Sampling

• Selects units at a fixed interval throughout the samplingframe after a random start.

• The steps are: – Number the units in the population from 1 to N, – Decide on the n (sample size) that you need,

– Calculate the Sampling Fraction k (K = N/n), – Randomly select an integer between 1 to k, – Then take every k th unit.

239

Systematic Random SamplingCont...

• Advantage: – It is easier and less time consuming to perform. – Rarely it can be conducted without sampling frame.

• Disadvantage:

– Can be biased when there is cyclic patter in the orderof the subjects.

240

3/1/2010



61

C. Stratified Sampling

• Applied when the source population is heterogeneouson a variable of interest.

• The population is first divided into classes (strata).

• Then a separate sample is taken from each stratumusing Simple or Systematic Random Sampling tech.

• The number taken from each stratum might be equal(Non Proportional Stratified Sampling) or the number isdetermined based on the proportion of each class inthe source population (Proportional StratifiedSampling).

241

Stratified Sampling Cont…

• Advantage: improves representativeness of the sample(Proportional Stratified Sampling) or it createsreasonable comparison among strata (Non ProportionalStratified Sampling).

• Limitation: Requires separate sampling frame for eachstratum.

242

D. Cluster Sampling

• Is a sampling method applied when the sourcepopulation is composed of “natural” groups.

• Assuming the groups are homogenous among eachother, Cluster sampling selects few groups (clusters)

from the population as Primary Sampling Unit (PSU).

• Then the required information is collected from allelements, Secondary Sampling Units (SSU), withineach selected group.

243

Cluster Sampling Cont..

• Advantage: – It doesn’t require the sampling frame of the SSU. – Requires less time and resource.

• Disadvantage: – Relies on the assumption of homogeneity among

clusters. – Less control on sample size.

244

3/1/2010

Probability Proportional to Size



62

E. Multistage Sampling

• Is like cluster sampling, but involves selecting a samplewithin each chosen cluster, rather than including all unitsin the cluster.

• Thus, multi-stage sampling involves selecting a samplein at least two stages.

• The advantage is it is simpler than SRS.

• But the disadvantage is as the “number of stages”increased, sampling error inflates.

245

Probability Proportional to SizeSampling Technique

• PPS is a variant of cluster sampling technique.• Useful when the sampling units vary considerably in

size.• Probability of selecting a sampling unit (e.g., village,

zone, district, health center) is proportional to the size ofits population.

Involves the following procedures• List all clusters with their respective source population

size and cumulative frequency.• Decide the number of clusters (a) which will be included

in the study.246

PPS Cont…

• Decide the number of individuals which will be studiedper one selection of a cluster (b).

• Divide the total population by number of clusters to bestudies. This will give you the sampling interval (SI)

• Choose a number between 1 and the SI at random. This

is the Random Start (RS) point.• Calculate the following series: RS; RS + SI; RS + 2SI;

.....RS + (a-1)SI.• Based on the cumulative frequency identify at which

clusters the selected numbers fall.• For every selection of a cluster select b individuals at

random from it. Note that if a cluster is selected twice 2bindividuals should be selected at random.

247

2. Nonprobablity Sampling

• Here, the sample is less likely to be representative ofthe population, thus it is difficult to extrapolate from thesample to the population.

• Is used when there is no sampling frame or when it isimpossible to conduct probability sampling due toeconomical and feasibility factors.

248

3/1/2010



63

Nonprobablity Sampling Cont..

• Judgmental or Purposive Sampling: The researcherchooses the sample based on who he/she think would beappropriate for the study.

• Convenience Sampling: The selection of units from thepopulation is based on availability and/or accessibility.

• Quota Sampling: It starts with systematically setting“Quota” to represent subgroups of a population. Thendata is collected to meet the predefined Quota.

• Snowball Sampling: The researcher begins by identifyingsomeone who meets the inclusion criteria of the study.Then the study subject would be asked to recommendothers who s/he may know who also meet the criteria. 249

Sampling Error

• Sampling error or estimation error is part of the totalerror or uncertainty caused by observing a sampleinstead of the whole population.

• Non-sampling errors such as non-response andreporting errors may also affect the outcome of a samplebased study.

• Theoretically estimated from a sample minus thepopulation value.

• Unlike bias, sampling error can be predicted, calculated,and accounted for.• There are several measures of sampling error.

250

Sampling Error Cont…

1. Standard error• Is a measure of the variability of an estimate due to

sampling.• It indicates the extent to which an estimate derived from

a sample survey can be expected to deviate from the

population value.• Depends upon the underlying variability in the population

for the characteristic as well as the sample size used forthe survey.

• The standard error is a foundational measure from whichother sampling error measures are derived.

251

Sampling Error Cont…2. Confidence intervals:• A range that is expected to contain the population value

of the characteristic with a known probability.3. Margin of error:• Is a measure of the precision of an estimate at a given

level of confidence.4. Coefficient of variance:• The relative amount of sampling error in comparison

with a sample estimate.• CV = SE / Estimate * 100%• No hard and fast rules to define acceptable level.• The smaller the CV, the more reliable the estimate.

252

3/1/2010



64

Sampling Error Cont…

5. P values:• is the probability of obtaining a test statistic at least as

extreme as the one that was actually observed,assuming that the null hypothesis is true.

Importance of such measures:• To indicate the statistical reliability and usability of

estimates.

• To make comparisons between estimates.• To conduct tests of statistical significance.• To help users draw appropriate conclusions about data.

253

Exercise 1

• A medical practitioner wanted to assess the quality offamily planning service offered in a hospital. Accordinglyhe made an exit interview to those women who have IDnumber of multiple of five. What sampling method isemployed?

254

Exercise 2

• A medical practitioner wanted to assess the prevalenceof malnutrition among under five children in a woreda.Assuming all kebeles in the woreda are similar, heincluded all under five children in two randomly selectedkebeles. – What sampling method is employed?

– What possible limitation do you expect?

255

Exercise 3

• A medical practitioner wanted to assess the prevalenceof malnutrition among under five children in a woreda.Assuming the problem is different across the three agro-ecological zones in the woreda he included children from2 kebeles each from Kolla, Dega and Woynadega. – What sampling method is employed? – What possible limitation do you expect?

256

3/1/2010



65

Exercise 4

• A researcher wanted to study the prevalence of drugaddiction among adolescents in Addis Ababa. First herandomly select Bole sub city. Then he selected woreda17 at random from all woredas in Bole sub city. Finallyhe conducted his study in Kebele 19 (after randomselection). – What sampling method is employed?

– What possible limitation do you expect?

– If woreda 17 was selected because of its proximity tothe organization of the researcher what would havebeen the sampling method?

257

Sampling Distribution andEstimation

Estimation

• Estimation refers to the process by which one makesinferences about a population, based on informationobtained from a sample.

• Can be of two types:

– Point Estimation – Interval Estimation

259

Point Estimate

• Point Estimate: A point estimate of a populationparameter is a single value of a statistic.

• The following table gives commonly used pointestimators.

260

3/1/2010

I l E i I l E i C



66

Interval Estimate

• An interval estimate is defined by two numbers, betweenwhich a population parameter is said to lie.

• For example, is an interval estimate of thepopulation mean .

• i.e. the population mean is greater than a but less than b.

• An interval estimate has got three components(concepts).

b X a <<

261

Interval Estimate Cont….

• An interval estimate has got three components (concepts) – A statistic: (the point estimator) – A margin of error: (the measure of precision) – A confidence level: (the measure of uncertainty)

• The interval estimate of a given confidence level isdefined by the sample statistic + margin of error.

• Interval Estimate is preferred than point estimate as itconsiders the precision and uncertainty of estimation.

262


Margin of Error• In a confidence interval, the range of values above and

below the sample statistic is called the margin of error.

• It measures the precision of a sampling method.

• It is the function of the confidence level and anotherparameter called the standard error.

263


• Confidence Level – The probability part of the interval. – It describes how strongly we believe that a particular

sampling method will produce an interval thatincludes the true population parameter.

– 90, 95, and 99% Confidence interval – For example, 95% CI means: If we used the same

sampling method to select different samples andcompute different interval estimates, the truepopulation mean would fall within a range defined bythe sample statistic + margin of error in 95% of thetime.

264

3/1/2010

I t l E ti t C t CI f i gl



67


• Example 6.1: – A local newspaper conducts an election survey and

reports that the independent candidate will receive30% of the vote. The newspaper states that thesurvey had a 5% margin of error and a confidencelevel of 95%.

– Meaning: We are 95% confident that the independent

candidate will receive between 25% and 35% of thevote.

265

CI for a single mean• Background Concept : Sampling Distribution of Means.

– One can generate sampling distribution of means in thefollowing manner:

– Obtain a sample of n observations selected completelyat random from a large population. Determine theirmean and then replace the observations in thepopulation.

– Repeat the sampling procedure indefinitely. – The result is a series of means of sample size n. – If each mean in the series is now treated as individual

observation and arranged in a frequency distribution,one comes up with the sampling distribution of means ofsamples of size n. 266

CI for a single mean cont..

• The sampling distribution of means has the followingproperties:

1. The mean of the sampling distribution of means is thesame as the population mean.

2. The SD of the sampling distribution of means (which iscalled the standard error of the mean) is:

3. Sampling distribution of means is approximately anormal distribution, regardless of the original distributionprovided n is large. ( Central Limit Theorem )

n x / σ σ =

267


• The general formula is

• CI=Sample statistic + Z value x SE

95.0)96.1 /

(-1.96Pr =≤

−

≤n

x

σ µ

[ ] 95.0) / (96.1) / (96.1Pr =+≤≤− n X n X σ µ σ

) / (96.1%95 n X for CI σ µ ±=

) / (2

n Z X for CI σ µ α ±=

268

3/1/2010

CI for a single mean cont CI for a single mean cont



68


• However when the population variance is unknown andthe sample size is less than 30: – Sample variance should replace population variance – Student t distribution should be used in the place of

standard normal distribution. – Hence the formula would be:

) / (, )1(2

nt X n σ µ α −±=

269


Example 6.2:• The mean blood glucose level of 100 randomly selected

healthy adults is 85mg/dl. Find 95% CI for the meanblood glucose level for all health adults (µ) given thestandard deviation for the population is 15mg/dl.

270

CI for difference between twomeans

• Background Concept : The Sampling distribution ofDifference of Means. – Consider two different populations X and Y. – The first population has mean of µ x and standard

deviation of x.

– The second population has mean of µ y and standarddeviation of y. – From the first population take a sample of size n x and

compute its mean . – From the second population take a sample size of n y

and compute its mean . – Then determine .

X

Y Y X − 271

CI for difference between twomeans cont…

• Do the same for all pairs of samples that can be chosenindependently from the two populations.

• The Differences are new set of scores which formthe sampling distribution of differences of means.

Y X −

272

3/1/2010

CI for difference between two CI for difference between two



69

means cont…• Properties of the sampling distribution of differences of

means.1. The mean of the sampling distribution of differences of

means equals to the difference of the population means( ).

2. The SD of the sampling distribution of differences ofmeans (SE) is equal to:

3. The distribution is approximately normally distributed.

21 µ µ −

2

22

1

21

)( nnY X

σ σ σ +=−

273

means cont…

95.0)96.1()()()96.1()(Pr2

22

1

21

212

22

1

21 =++−≤−≤+−−

nnY X

nnY X

σ σ µ µ

σ σ

95.0)96.1)()(

96.1(Pr

2

22

1

21

21 =<

+

−−−<−

nn

Y X

σ σ

µ µ

)()(2

22

1

21

221 nn

Z Y X σ σ

µ µ α +±−=−

)96.1()(%952

22

1

21

21

nn

Y X of CI σ σ

µ µ +±−=−

274

CI for difference between twomeans cont….

Example 6.3:• A randomly selected 120 HIV patients who were on ART

had averagely lived for 25 years with SD of 5 years sincetheir diagnosis for the virus was made. Similarly arandomly selected 140 HIV patients who were not onART had averagely lived for 14 year with SD of 4 years.

• Calculate the point estimate for the difference betweenthe population means.

• Find the 95% CI for the difference between the means.

275

CI for single proportion

• Background Concept : The Sampling distribution ofProportions

• Here we are interested in the proportion of thepopulation that has a certain characteristic representedby P or .

• If we take indefinite random sample of n observation andif we calculate p for all samples then we will havesampling distribution of proportions.

• The sampling distribution of proportion has the followingcharacteristics:

276

3/1/2010

CI for single proportion cont CI for single proportion cont



70

CI for single proportion cont…

• The sampling distribution of proportions has thefollowing properties:

1. The mean of sampling distribution of proportions = ,2. The SD (SE) of the sampling distribution of proportions:


nPP

P

)1( −=σ

277

CI for single proportion cont..

95.0)96.1)1(

96.1(Pr =<−

−<−

nPP

p π

nPP

p for CI )1(

(96.1%95−

±=π

))1(

(2 n

PP Z p

−±= α π

95.0))1(

(96.1))1(

(96.1Pr =−

+≤≤−

−n

PP p

nPP

p π

278

CI for single proportion cont..

Example 6.4:• In Addis Ababa blood test of randomly selected 120

commercial sex workers revealed that 30 of them areHIV positive. What will be the 99% confidence interval ofHIV/AIDS prevalence for whole commercial sex workersin the city?

279

CI for difference between twoproportions

• Consider two different populations X and Y.• The first population has proportion of and the second

population has proportion of .• From the first population take a sample of size n x and

compute its sample proportion p x. From the second

population take a sample size of n y and compute itssample proportion p y.

• Then determine p x-py.• Do for all pairs of samples that can be chosen

independently from the two populations.• The Differences p x-py are new set of scores which form

the sampling distribution of differences of proportions.280

3/1/2010

CI for difference between two CI for difference between two



71

proportions cont…• The sampling distribution of differences of proportions

has the following properties:1. The mean of the sampling distribution of differences of

proportions equals the difference of the populationproportion ( - )

2. The SD (SE) given as:


2

22

1

11)(

)1()1(21 n

p pn

p p p p

−+

−=−σ

281

proportions cont…

95.0)96.1)1()1(

)()(96.1(Pr

2

22

1

11

2121 =<−

+−

−−−<−

n

p p

n

p p

p p π π

95.0))1()1(

96.1()()()1()1(

96.1()(Pr2

22

1

111121

2

22

1

1111 =

−+

−+−≤−≤

−+

−−−

n

p p

n

p p p p

n

p p

n

p p p p π π

2

22

1

112121

)1()1((96.1)(%95

n

p p

n

p p p p for CI

−+

−±−=− π π

2

22

1

11

2

2121)1()1(

()(n

p pn

p p Z p p

−+

−±−=− α π π

282

CI for difference between twoproportions cont…

• Example 6.5:• Among randomly selected 200 illiterate married women,

50 of them use contraceptive. Similarly, among randomlyselected 300 married women who can read and write,150 of them use contraceptive.

• Calculate the point estimate for the difference betweenthe population proportions.

• Find the 95% CI for the difference between theproportions.

283

CI for OR and RR

• When the intention of measurement of association is tohave inference about a population parameter, CI for ORor RR can be calculated using the following formula.

• Why do we need natural logarithm here?

]1111

[ln(OR)expORforCI2 d cba

Z +++±= α

( ) ( )]

11[ln(RR)expRRforCI

2 cd c

c

aba

a Z +−

++−±= α

284

3/1/2010

CI for OR and RR Cont.. Unbiased and Biased Estimators



72

CI for OR and RR Cont..• SPSS can compute OR and RR with their confidence intervalsgiven the information is fed in the following manner.• Create 3 variables in the variable view page:

– Frequency (for the four cells), – Exposure (0 as Yes, 1 as No) and – Outcome (0 as Yes, 1 as No)

• Enter the values into the data view page as mentioned above.• Weight cases based on “frequency” variable.• Do the analysis in the following manner:

– Descriptive statistics > Cross tabs > Put “exposure” as rowand “outcome” as column > Statistics > Check “risk” >Continue > Ok

– OR is given as “Odds ratio for exposure (yes/no)” – RR is given as “For cohort disease = yes” 285

• A statistic is called an unbiased estimator of a populationparameter if the mean of the sampling distribution of thestatistic is equal to the value of the parameter.

• Based on the Central Limit Theorem, the sample mean is anunbiased estimator of population mean.

• If the mean value of an estimator is either less than orgreater than the true value of the quantity it estimates, thenthe estimator is called a biased.

• A case of biased estimation is seen to occur when samplevariance, is used to estimate the population variance usingthe following formula:

286

Unbiased and BiasedEstimators Cont…

• The sample variance calculated using this formula is alwaysless than the true population variance.

• This is because sample observations are closer to eachother than population observation.

• To compensate for this, n-1 is used as the denominator.• It is important to note that, using n-1 as the denominator, the

sample variance still remains a biased estimator of thepopulation standard deviation, but for large sample sizesthis bias is negligible.

287

Estimation of Sample Size forCross Sectional Studies

Why we need to calculate sample size:

• Representativeness Vs Cost• Estimation can be made based on a given confidence

level and standard error.

288

3/1/2010

Sample Size to Estimate a SinglePop l tion Proportion

Sample Size to Estimate a SingleP l ti P ti C t



73

Population Proportion

2

2

2

)1(

d

PP Z

n

−

=α

• If the main objective of the study is to estimate singlepopulation proportion, then the sample size can bedetermined using the formula:

Where;n is the minimum sample size required for very large

population ( 100,000)Z is the critical value for a given confidence intervalP is expected proportion of the event to be studied (to

be estimated based findings of previous studies)d is margin of error 289

Population Proportion Cont…NB:• If p is not known it has to be taken as 0.5. (Why?)• Depending on the nature of the study 10-15%

contingency should be added.• If the size of the population is less than 100,000 the

sample size should be corrected using the formula;

• Where: – n is the non-corrected sample size – N is the size of the source population

NnNxn

sizesampleCorrected+

=

290

Sample Size to Estimate a SinglePopulation Proportion Cont…

Example 6.6:• A researcher is interested to determine the prevalence of

family planning use in Addis Ababa city. A previousstudy indicates the prevalence is around 55%. If theresearcher is interested to determine the sample sizewith 95% CI and 5% of margin of error, what number ofwomen of reproductive age should be included into hisstudy?

291

Sample Size to Estimate SinglePopulation Mean

• If the main objective of the study is to estimate singlepopulation mean, then the sample size can be determinedusing the formula:

• Where: – n is the minimum sample size required for large

population – Z is the critical value for a given confidence level– is the expected SD of the event to be studied

– d is the margin of error

2

2

=

d

Z

n

σ α

292

3/1/2010

Sample Size to Estimate SinglePopulation Mean



74

Population MeanExample 6.7:• A researcher is interested to determine the mean blood

glucose level among high school students. A previousstudy indicates the mean is 85mg/dl with standarddeviation of 15mg/dl. If the researcher is interested todetermine the sample size with 95% CI and tolerates 2mg/dl margin of error, what number of students shouldbe included into his study?

293

Hypothesis Testing

What is a Hypothesis

• A statistical hypothesis is an assumption or a statementwhich may or may not be true concerning one or morepopulation.

• Setting up and testing hypotheses is an essential part ofstatistical inference.

• Examples of statistical hypothesis: – The mean pulse rate among AAU-HI students is 72/min. – The prevalence of HIV in AA is 12%. – The mean blood glucose level among Chinese and

Indians is the same. – The prevalence of Hypertension in US and UK is the

same. – The mean blood cholesterol level is the same before

and after taking a drug.295

Steps in Hypothesis Testing

Hypothesis testing involves the following steps:1. Choose the hypothesis to be tested,2. Choose an alternative hypothesis which would be

accepted if the first hypothesis is rejected.3. Decide on the appropriate test statistic for the

hypothesis ( Z, t, X 2 )4. Decide the level of significance and corresponding

critical value.5. Obtain the value of the test statistic.6. Make a decision and interpret it.

296

3/1/2010

The Null and AlternativeHypothesis

The Null and AlternativeHypothesis Cont



75

Hypothesis• In hypothesis testing two hypotheses are involved: The Null

Hypothesis and the Alternative Hypothesis.

• Every hypothesis test requires the analyst to state a nullhypothesis and an alternative hypothesis.

• They are mutually exclusive and complementary events.

• Both hypotheses are about the parameter not about thestatistic.

• The null hypothesis (H 0 or H N): – The first hypothesis to be set by the researcher. – It commonly implies the meaning of “equals to”, “no

effect” or “no difference”, “no association” conclusions.297

Hypothesis Cont..Example;• The mean pulse rate among AAU-HI students is 72/min.• Drug A has no effect on the blood glucose level of

diabetic patients.• There is no difference in the prevalence of malaria in

region A and Region B.• There is no association between smoking and lung

cancer.

298

The Null and AlternativeHypothesis Cont..

• The alternative hypothesis (H A or H1)• The hypothesis that will be accepted if H 0 is rejected.• Implies conclusions like “is not equal”, “has effect”, “there

is difference” and “there is association”.

Example:

• The mean pulse rate among AAU-HI students is notequal to 72/min.• Drug A has effect on the blood glucose level of diabetic

patients.• There is difference in the prevalence of malaria in region

A and B.• There is association between smoking and lung cancer.

299

Test Statistic

• In hypothesis testing we accept or reject the hypothesisthrough calculating the probability of getting theestimated sample value given the hypothesizedpopulation value is true.

• If the probability is very low we reject the null hypothesis.• The probability is calculated using test statistic.• The most commonly used test statistic are Z, student’s-t

and X 2 tests.• The general formula to calculate test statistic is:

SE valueed hypothesizestimate

statistictest )()( −

=

300

3/1/2010

Test Statistic Test Statistic Cont…



76

Student’s t Distribution:• The use of z-test requires a knowledge of the variance of

the population from which the sample is taken.• It is somewhat strange that once can have knowledge of

the population variance and not know the value of thepopulation mean.

• In statistics as long as sample size is large enough, mostdatasets can be explained by standard normal dist.

• But when the sample size is small and population SD isnot known, statisticians rely on the distribution of the tstatistic.

301

• Student’s t distribution was developed by William Gosset(1876-1937) under the pseudonym of “Student t”.

• There are many different t distributions. (t distribution is afamily of distributions)

• The particular form of the t distribution is determined byits Degrees of Freedom (df).

• The degrees of freedom (df) refers to the number ofindependent observations in a dataset after somerestriction is made.

n

s x

t ][ µ −

=

302

Test Statistic Cont…

• The t distribution has the following properties: – The mean of the distribution is equal to 0. – Symmetrical about the mean. – The variance is equal to v / ( v - 2 ), where v is the df .

(i.e. V>2) In general the variance is greater than 1,

but approaches 1 as the sample size becomes large. – Extends from + infinity to – infinity – Compared to normal distribution, t distribution is less

picked in the center and has higher tails. – The t distribution approaches the normal distribution

as n-1 approaches infinity.303


304

3/1/2010

Test Statistic Cont… Test Statistic Cont…



77

• For the t distribution to apply strictly we need thefollowing two assumptions:

1. The observations are selected at random from thepopulation.

2. The population distribution is normal.• Sometimes the second assumptions may not be met as

the t test is robust for departures from the normaldistribution.

• That means even when assumption 2 is not satisfied, theprobabilities calculated from the t table are stillapproximately correct.

305

Chi Square Distribution ( X 2 ):• Mainly developed by Karl Pearson (1857-1936)• A type of probability distribution like Z or t.• Represented by the Greek letter Chi ( )• It is the distribution of the sum of the squared values of

the observations drawn from the N (0,1) distribution.• Let {X 1, X 2, ..., X n } be n independent random variables,

all ~ N (0,1).

• Then the X 2 n is defined as the distribution of the sum X 1²+ X 2² +...+ X n ².

306


• Mainly used to check association between twocategorical variables.

• It is the most frequently used statistical technique foranalysis of count or frequency data.

• It is not a distribution but rather a family of distributions,indexed by the df .

• The mathematical formula of X 2 distribution is given as(where x is 0):

)2 / (1)2 / (

2

)2

1(

)!12

(

1 xk k e x

k Y −−

−=

307


• The graph is given as:

308

3/1/2010

Test Statistic Cont… Test Statistic Cont…



78

• The formula for the test statistic which approximates X 2

distribution is: (where O is the observed frequency and Eis expected frequency)

• It has the following characteristics: – Extends indefinitely to the right from 0. – Has only one tail.

– As the df increase, the chi-square curve approachesa normal distribution.

309 310

Errors in Hypothesis Testing• In testing hypothesis, two types of errors can be

committed: Type I and Type II errors.

• The probability of committing type I error is denoted as. It is also called the Level of significance. (1-

confidence level)• The probability of committing type two error is denoted

as . (1-power of the study)

Decision of thehypothesis testing

Accept H 0 Reject H 0

NullHypothesis H0 True Correct Type I errorH0 False Type II error Correct

311

One and Two Tailed Hypothesis

• Some hypotheses test whether one value is differentfrom another or not, without additionally predicting whichwill be higher: Non-directional or two-tailed test

• At times some hypotheses not only test difference of onevalue from the other but also direction of the difference.

i.e. it would be lower or higher: Directional or one-tailedtest.

312

3/1/2010

Level of Significance, CriticalValues and Critical Area




79

• In practice, the level of significance ( ) is chosen arbitrarily.• Three levels 0.01, 0.05, or 0.10. (depending on confidence

level)• The smaller the level of significance, the stronger the

hypothesis test.• The level of significance determines the values of the test

statistic that would cause us to reject the hypothesis.• The corresponding test statistic values for the level of

significance are called the Critical Values.• In a probability distribution the area which is left to the

extreme right or/and left of the critical value is called theCritical area (Rejection area).• The area between the two critical values is called the

Acceptance Area.313 314


• A level of significance has different critical values for oneand two tailed test,

• Level of significance of 0.05 has critical value of ±1.96 ifthe test is two tailed.

• However if the test is one tailed the critical value wouldbe 1.64 to either of the tails.

• Note that critical values for a given level of significancediffer depending on the test statistic intended to be used.

315


316

3/1/2010





80

317 318


(level ofsignificance)

Two tailedtest

On tailedtest, <

On tailed test,>

0.10 ±1.64 -1.28 1.280.05 ±1.96 -1.64 1.640.01 ±2.58 -2.33 2.33

319

Interpretation and Conclusion

• Interpretation is made based on comparisons between: – Test Statistic Calculated Vs Critical Value. – P value Vs significance level.

• Conclusion (i.e. accepting and rejecting the null

hypothesis) should be made at the given level ofconfidence.

320

3/1/2010

Test of Hypothesis about SinglePopulation Mean

Test of Hypothesis about SinglePopulation Mean Cont..



81

• Shows how to test the null hypothesis that the populationmean is equal to some hypothesized value.

• One begins with a statement that claims a particularvalue for the unknown population mean.

• The hypothesis testing for single population mean eitheraccepts or rejects this statement.

• The Z test and the t test used. – Sample > 30: Z test

– Sample < 30 and population SD known: Z test – Sample < 30 and population SD unknown: t test

321

n

X Z

/ σ µ −

=

nS

X t

/

µ −=

322


Example 7.1:• Researchers are interested in the mean level of an

enzyme in a certain population. They take a sample of36 individuals, determine the level of enzyme in eachand compute a sample mean 22. It is known that thevariable of interest is approximately normally distributedwith a standard deviation of 10. Let’s say that they areasking the following question: Can we conclude that themean enzyme level in this population is different from25?

323


• Step 1 and 2 : Define the H o and H 1:

• Step 3 : Decide approprate test statistic: – Z test

• Step 4 : Decide the level of significance and critical value:

– value of 0.05. – ±1.96 is the critical value.

• Step 5: Obtain the value of the test statistic:

25: = µ o H 25:1 ≠ µ H

324

3/1/2010





82

n

X Z

/ σ µ −

=

36 / 10

2522 −= Z

1.673−

= Z

80.1−= Z

325

• Step 6: Make a decision and interpret it. – Accept the H 0 at 95% confidence level: – 1.80 is with in the acceptance region. – P value of 0.036 is > /2 value of 0.025.

326


Example 7.2:• The researchers mentioned in example 7.1, instead of

asking if they could conclude that µ≠ 25, they asked: Canwe conclude that the mean enzyme level in thispopulation is less than 25?

Solution:• Step 1 and 2: Define the H 0 and H 1:

25: ≥ µ o H

25:1 < µ H

327



• Step 4 : Decide the level of significance and criticalvalue: – value of 0.05.

– ±1.645 is the critical value.• Step 5: Obtain the value of the test statistic:

n

X Z

/ σ µ −

=36 / 10

2522 −= Z

1.673−

= Z 80.1−= Z

328

3/1/2010





83

• Step 6: Make a decision and interpret it. – Reject the H0 with 95% confidence level – Test statistic -1.80 is with in the acceptance region.

– P value of 0.036 is less than the value of 0.05.

25≥ µ

329

Example 7.3:• Serum Amylase level determination was made on a

sample of 15 apparently health subjects. The sampleyielded the mean of 96 units/100 ml and a standarddeviation of 35 units /100 ml. The variance of thepopulation was unknown. We want to know wheter wecan conclude that the mean of the population is differentfrom 120 units/100 ml.

330


• Step 1 and 2 : Define the H 0 and H 1.

• Step 3: Decide approprate test statistic. – t test

• Step 4: Decide level of significance and critical value.

– value of 0.05. – t value for of 0.0025 at df of 14: ±2.145

• Step 5: Obtain the value of the test statistic.

120: = µ o H 120:1 ≠ µ H

nS

X t

/

µ −=

15 / 35

12096 −=t 65.2−=t

331


• Step 6: Make a decision and interpret it.• We reject the null hypothesis b/c

– The cal test statistic -2.65 is in the rejection area – The corrspoinding P value of -2.65 (b/n 0.01 and

0.005) is less than the /2 value of 0.025.

332

3/1/2010

Testing of Hypothesis about TwoPopulation Means

Testing of Hypothesis about TwoPopulation Means Cont..



84

• Compare the difference between two populations mean.• H0: there is not difference between the two mean.• H1: there is difference between the two means.• Z or t test can be employed.

• Sum-up the sample size of the two groups, if it is greaterthan 30 use Z test, if less than 30 use t test.

2

22

1

21

21 )(

nn

X X Z

σ σ +

−=

333

• t test is carried out with df of n1+n2-2

2

2

1

2

21 )(

nS

nS

X X t

+

−=

2)1()1(

21

222

211

−+−+−

=nn

SnSnS

334


Example 7.4:• A researcher wants to check whether the systolic blood

pressure among males is different from females or not.Among 50 male samples the mean SBP was 100mmHgwith standard deviation of 5 mmHg. Among 60 females,the mean SPB was 104mmHg with standard deviation of

10 mmHg. Is there significant difference between the twomeans?

335


• Step 1 and 2: Define the H 0 and H 1


• Step 4 : Decide the level of significance and criticalvalue:

– value of 0.05. – ±1.96 is the critical value.


f mo H µ µ =: f m H µ µ ≠:1

2

22

1

21

21 )(

nn

X X Z

σ σ +

−=

6010

505

10410022

+

−= Z

67.15.0

4+

−= Z 72.2

7.14

−=−

= Z

336

3/1/2010





85

• Step 6: Make a decision and interpret it. – We reject the H0 and accept the H1 (at 95%

confidence level) b/c – The cal test statistic -2.72 is in the rejection region. – The corrspoinding P value of -2.72 (0.0033) is less

than the value of 0.025.

f m µ µ ≠

337

Example 7.5:• Serum amylase determination was made on a sample of

15 apparently health subjects and 12 hospitalizedsubjects. Among health subjects, the mean was 96units/100ml with standard deviation of 35 units/100 ml.Among hospitalized patients, the mean was 120units/100ml with standard deviation of 40 units/100 ml. Isthere significant difference between the two meanvalues?

338

Testing of Hypothesis about TwoPopulation Means Cont…


• Step 3: Decide approprate test statistic. – t test

• Step 4: Decide level of significance and critical value.

– value of 0.01. – t value for /2 of 0.005 at df of 25: ±2.787

• Step 5: Obtain the value of the test statistic

21: µ µ =o H 211 : µ µ ≠ H

3.37139025

176001715025

)40)(11()35)(14(2

)1()1( 22

21

222

211 ==

+=

+=

−+−+−

=nn

SnSnS

339


• Step 6: Make a decision and interpret it.• We accept the null hypothesis (at 99% confidence level)

b/c:• The calculated test statistic -1.67 is in the acceptance

region.• The corrspoinding P value of -1.67 (which is b/n 0.1 and

0.05) is greater than the value of 0.005 .

67.14.14

24

123.37

153.37

1209622

−=−

=

+

−=t

340

3/1/2010





86

Paired t test for difference between two means:• Every observation in one sample has one matching

observation in the second sample.• Commonly used in evaluation of interventions like new

treatment modalities.• Hence pre and post intervention (treatment) results are

compared.• Usually t test is used since individuals involved in the

trial are few.

• The null hypothesis: there is no significant differencebetween the two tests.

341

• Procedures of hypothesis testing are the same. Exceptthe formula for the test statistic calculation.

– d = mean of differences between the two samples. – SD = is the standard deviation for the differences

between the two samples. – n = the number of paired cases.

• Note that the calculated test statistic is compared atdegree of freedom of n-1.

nSD

d t =

342


Example 7.6:• A random sample of 10 young men was taken and the

pulse rate was measured before and after taking a cupof coffee. The result is given as follows. Does the coffeehas any effect on the heart rate? (perform the hypothesistesting with 95% CI)

343

Testing of Hypothesis Cont…

Subject PR before PR after Difference1 68 74 +62 64 68 +43 52 60 +84 76 72 -45 78 76 -26 62 68 +67 66 72 +68 76 76 09 78 80 +2

10 60 64 +4Mean 68 71 +3 344

3/1/2010


ff ff

Test of Hypothesis About SinglePopulation Proportion



87

• H0: Coffee intake has no effect on PR• H1: Coffee intake has effect on PR• Test statistic: t test (paired)• Critical value ±2.262• First calculate the SD then the test statistic:

• Reject the null hypothesis (at 95% confidence level)

• Coffee intake has effect on PR.

92.31

)( 2

=−

−

n

d di 4.2

1092.3

3 ==t

345

• The null hypothesis that the population proportion isequal to some hypothesized value.

• One begins with a statement that claims a particularvalue for the unknown population proportion.

• The hypothesis testing for single population proportioneither accepts or rejects this statement.

• Here Z test statistic is used. The formula is given as:

n

p Z

)1( π π π

−

−=

346

Test of Hypothesis on MeansUsing SPSS

• In SPSS One sample T test, independent T test andpaired sample T test are available under;

• Analyze > means > One sample T test or independent Ttest or paired sample T test

347


Example 7.7:• A survey was conducted to determine the prevalence of

protein energy malnutrition in a rural kebele. Of 300under five children assessed, 123 were stunted. Can weconclude that the prevalence of PEM in the population is50%?

348

3/1/2010


S 1 d 2 D fi h H d H


S 6 M k d i i d i i



88


• Step 3: A pproprate test statistic: – Z statistic

• Step 4: Decide the level of significance and thecorresponding critical value: – Let’s take value of 0.1. Hence ±1.645 is the critical

value.


5.0: =π o H

11.3

30025.0

09.0

300)5.0(5.0

5.041.0

)1(−==

−=

−

−=

n

p Z

π π π

5.0:1 ≠π H

349

• Step 6: Make a decision and interpret it.• At 90% confidence level wee reject the null hypothesis

that P=0.5. – The calculated test statistic -3.11 is in the rejection

region. – The corrspoinding P value of -3.11 (i.e. 0.0009) is

less than the value of 0.05.

350

Testing of Hypothesis AboutTwo Population Proportions

• The null hypothesis that a population proportion is equalto another population proportion.

• The hypothesis testing for single population proportioneither accepts or rejects this statement.

• Here Z test statistic is used. The formula is given as:

+−

−=

21

21

11)1(

nn pP

p p Z

21

2211

nn pn pn

P++

=

351


Example 7.8:• The prevalence of malaria among two malaria endemic

kebeles X and Y was compared. In kebele X among 120samples 15 were positive. In kebele B among 100samples 20 were positive. Is there any significantdifference between the prevalence of malaria kebele X

and Y?

352

3/1/2010


• Step 1 and 2: Define the H 0 and H 1:


2211 pnpn + )20(100)1250(120 + 2015 +



89

• Step 1 and 2: Define the H 0 and H 1:

• Step 3: Decide approprate test statistic: – Z statistic

• Step 4: Decide value & the critical value: – Let’s take value of 0.05. Hence ±1.96 is the critical

value.

• Step 5: Obtain the value of the test statistic: – First calculate the proportions & the pooled proportion – P1 = 15/120 = 0.125, P2 = 20/100 = 0.2

21: PP H o = 211 : PP H ≠

353

• Then we calculate the test statistic:

• Step 6: Make a decision and interpret it.

At 95% confidence level we accept the H0 P1=P2 b/c: – -1.51 is in the acceptance region. – - 0.0655 is greater than the value of 0.025.

21

2211

nn pn pnP

++=

100120)2.0(100)125.0(120

++=P 159.0

2202015 =+=P

+−

−=

1001

1201

)159.01(159.0

2.0125.0 Z

( )51.1

0.01830.1337

075.0 −=−

= Z

354

Test of Hypothesis onProportions Using SPSS• There is no “point and click” option in SPSS to do such

hypothesis testing on proportions.

• Syntax based analysis can be done.

355

Test of Hypothesis aboutCategorical Data• It is also possible to apply hypothesis testing on

categorical data.• The Chi-square (

2 ) test statistic commonly used.• This test is usually applied to tabulated data.• The table contains two variables called the row and

column variables.

• The test measures the discripancy between K observedfrequencies (O) and correspoinding K expectedfrequencies (e). i.e. for all cells of the tabulation.

• Expected frequencies are frequencies which happenwhen there is no association between the raw andcolumn variables.

356

3/1/2010

Test of Hypothesis aboutCategorical Data

• The H 0 of Chi square test is there is no association


• Assumptions of Chi square test:



90

• The H 0 of Chi-square test is there is no associationbetween the row and column variables.

• While the H 1 is there is associaiton between the row andcolumn variables.

• The closer observed frequencies are to expectedfrequencies, the more likely the H0 is true.

=

−=

k

i i

ii

eeO

x1

22 )(

totalgrand cellthe for totalcolumn xcellthe for totalrowe =

357

• Assumptions of Chi-square test: – No cell of the table has expected frequency less than

1, – No more than 20% of the the expected frequencies

should be less than 5.• Chi-square test should compaired with chi-square

disribution with df of (R-1)(C-1).• Though the distribution of Chi-square is one tailed, the

test is always two tailed.

358

Test of Hypothesis aboutCategorical DataExample 7.9:• A researcher is interested to assess the effect of litracy

on family planning use. Accordingly he collected dataand tabulated the findings in the following manner. Canwe say there is association between educational statusand family planning use?

FP use Educational StatusIlliterate Literate Total

Yes 63 49 112

No 15 33 48

Total 78 82 160359

Test of Hypothesis aboutCategorical Data• Step 1 and 2: Define the H 0 and H 1:

– H0: There is not association between litracy andfamily planning use.

– H1: There is association between litracy and familyplanning use.

• Step 3: Decide approprate test statistic: – X 2 test.

• Step 4: Decide and the corresponding critical value: – Let’s take value of 0.01. – At df of 1 the critical value is 6.635. – Accptance area is 0-6.635, Rejection area X 2 > 6.635.

360

3/1/2010




− − − − )6.2433()4.2315()4.5749()6.5463( 22222



91

Step 5: Obtain the value of the test statistic: – First the expected frequency should be calculated:

• Expected frequency for cell a: 78 x 112/160 = 54.6• Expected frequency for cell b: 82 x 112/160 = 57.4• Expected frequency for cell c: 78 x 48/160 = 23.4• Expected frequency for cell d: 82 x 48/160 = 24.6

– Assumptions of X 2 test fulfilled. – Then we calculate the Chi-square statistic.

=

−=

k

i i

ii

e

eO x

1

22 )(

361

• Step 6: Make a decision and interpret it.• At 99% confidence level we accept the H 1 that the two

variables are associated due to the following reasons: – The calculated test statistic 8.41 is in the rejection area. – The corrspoinding P value of 8.41 (between 0.005 and

0.002) is less than the value of (0.01).

• But how is the direction of association?

+

+

+

=

6.24)(

4.23)(

4.57)(

6.54)(2 x

( ) ( ) ( ) ( ) 41.887.202.323.129.12 =+++= x

362

Test of Hypothesis aboutCategorical Data Using SPSS• In order to do chi-square test using SPSS, track the

following steps.

• Analyze > Descriptive Statistics > Cross tab > Put the twocategorical variables as column and row > Statistics >Check “Chi-square” > Ok.

• Chi-square test is given in a table as “Pearson Chi-square”.

363

Fisher's exact test

• Fisher's exact test is a statistical significance test used in theanalysis of contingency tables when sample size is small.(when assumption of chi square test are not fulfilled)

• It is named after its inventor, R. Fisher.• For hand calculations, the test is only feasible in the case of a

2 x 2 contingency table.• Its application to higher order tables is controversial.• H0: there is no association between the two variables• H1: there is association between the two variables• The hypothesis is tested by comparing the probability of

observing the given or more extreme tables with the level ofsignificance, given the null hypothesis is true. 364

3/1/2010

Fisher's exact test

a b (a+b)

Fisher's exact test

• Hypothesis testing using fisher’s exact test involves the



92

• The exact probability of observing a given table is given as:• = [(a+b)!(c+d)!(a+c)!(b+d)!]/[N!a!b!c!d!]

a b (a+b)c d (c+d)

(a+c) (b+d) N

365

Hypothesis testing using fisher s exact test involves thefollowing steps:

1. Calculate the probability of the observed table itself,2. List all possible extreme tables manually (given the

marginal totals are maintained),3. Calculate their respective exact probability,4. Calculate the probability of getting observed or more

extreme tables,5. Multiply the total by 2 (to get 2 tailed value)6. Compare the value with value of

366

Fisher's exact test

Example 7.10:• In the following tabulated data, Is there any

association between the treatment type and survivalrate of patients? (Test the hypothesis at 95%confidence level)

Treatment type Survived Died TotalA 7 2 9

B 5 6 11

Total 12 8 20

367

Fisher's exact test

• H0: No association between the treatment modalities andsurvival rate.

• H1: There is association between the treatmentmodalities and survival rate.

• Test statistic: F exact test b/c two of the expectedfrequencies have values less than 5.

• Level of significance: 5%• Calculate the probability of getting the given or more

extreme tables.

368

3/1/2010

Fisher's exact test

• Observed table:

Fisher's exact test

• First possible extreme table:



93

• Probability of observing this table = 9!11!12!8!/20!7!2!5!6!= 0.132

Treatment type Survived Died TotalA 7 2 9B 5 6 11

Total 12 8 20

369

p



Total 12 8 20

370

Fisher's exact test• Second possible extreme table:



Total 12 8 20

371

Fisher's exact test

• Probability of getting the observed or more extremetables: – 0.132 + 0.024 + 0.001 = 0.157 (one tailed) – Two tailed 2 x 0.157 = 0.314

• Conclusion and interpretation: – Accept the null hypothesis at 95% confidence level – There is no association between the treatment

modalities and survival rate.

372

3/1/2010

Fisher's exact test usingSPSS

• In order to do Fisher’s exact test using SPSS, track thefollowing steps

Summary

• The interpretation of the hypothesis test is dependent on theconfidence level at which the test is conducted



94

following steps.

• Analyze > Descriptive Statistics > Cross tab > Put thetwo categorical variables as column and row > Statistics> Check “Chi-square” > Ok.

• Fisher’s exact test is given in a table titled “Chi-squaretests”.

• NB: SPSS doesn’t do Fisher’s exact test for higher ordertables. 373

confidence level at which the test is conducted.• A hypothesis which is accepted at a lower level of confidence

can not be rejected at a higher level of confidence.• A hypothesis which is rejected at a lower level of confidence

can be accepted at a higher level of confidence.• A hypothesis which is rejected at a higher level of confidence

can not be accepted at a lower level of confidence.• A hypothesis which is accepted at a higher level of confidence

can be rejected at lower level of confidence.

374

Sample Size Calculation forComparative Studies.• The concept discussed in this chapter can be applied to

the calculation of sample size for comparative studies.• For comparative studies like case control, cohort,

interventional ,optimal size for the two groups iscalculated using the formula;

• Where

221

2211

21 )(

)1()1()1()

11(

PP

r PP

PP Z pPr Z n

−

−+−+−+

= β α

r

rPPP

++

=1

21

375

Sample Size CalculationCont..• Were;

P is the pooled proportionP1 is the expected 1 st proportionP2 is the expected 2 nd proportionr is the number of controls per a case

Alpha is the probability of type I errorBeta is the probability of type II errorn1 is sample size for the first group

NB: n2 is calculated by multiplying n 1 by r.

376

3/1/2010

Regression and Correlation

• Many medical investigations are concerned with:E t bli h t f l ti hi b t t i bl



95

Correlation and LinearRegression

– Establishment of relationship between two variables. – The strength of a relationship. – Predicting one variable on the basis of another. – Controlling the effect of unwanted variables.

• Such intentions can be addressed either by usingcorrelation or regression analysis.

378

Correlation Analysis• Initially developed by Sir Francis Galton (1888) and Karl

Pearson (1896)• Correlation is the quantification of the degree to which two

random quantitative variables are related provided therelationship is linear.

• Both of the variables should be measured on the sameset of study units.

• Strength of relationship measurement: CorrelationCoefficient.• Most commonly used coefficients: Product Momentum

Correlation or Pearson Correlation Coefficient (r).• The symbol rho ( ) used to represent population

correlation coefficient• Unit less measure.

ρ

379

Correlation Analysis• Does not imply cause and effect relationship.• The value of r ranges from -1 to +1.• If the correlation coefficient is greater than 0, the

variables are said to be positively correlated (i.e. as Xincreases, Y tends to increase).

• If the correlation coefficient is less than 0, the variablesare said to be negatively correlated (i.e. as X increases,

Y tends to decrease).• If the correlation coefficient is 0 then the variables aresaid to be uncorrelated.

380

3/1/2010

Correlation Analysis Cont…

• The formula for computing sample correlation coefficient(r) for two variables X and Y is given as:


Linear relationships Curvilinear relationships



96

• Or

• Before computing r, scattered plot between the twovariables should be drawn. Why?

−−

−−=

])(][)([

))((22 y y x x

y y x xr

−−

−=

])()(][)()([ 2222 y yn x xn

y x xynr

381

y

x

y

x

y

y

x

x382


y

x

y

x

y

y

x

x

Strong relationships Weak relationships

383


y

x

y

x

No relationship

384

3/1/2010


• Assumptions of correlation analysis:– Independent random samples are taken


Example 8.1:• The data of a random sample of 20 countries are shown



97

Independent random samples are taken – Both variables are on interval/ratio scale – Linear association between X and Y – Paired measures for X and Y – Normal distribution for X and Y – Homogeneity of variance (Homoscedasticity)

• In situations where its assumptions are violated,correlation becomes inadequate to explain a givenrelationship.

385

The data of a random sample of 20 countries are shownin the following table. X represents the percentage ofchildren immunized by age one year and Y representsthe under five year mortality rate. Determine the strengthof association between the two variables.

386


387

Country % Immunized (X) CMR/1000LB (Y) XY Y 2 X2

Bolivia 77 118 9086 13924 5929Brazil 69 65 4485 4225 4761Cambodia 32 184 5888 33856 1024Canada 85 8 680 64 7225China 94 43 4042 1849 8836Czech 99 12 1188 144 9801Egypt 89 55 4895 3025 7921Ethiopia 13 208 2704 43264 169Finland 95 7 665 49 9025France 95 9 855 81 9025

Greece 54 9 486 81 2916India 89 124 11036 15376 7921Italy 95 10 950 100 9025Japan 87 6 522 36 7569Mexico 91 33 3003 1089 8281Poland 98 16 1568 256 9604Russia 73 32 2336 1024 5329Senegal 47 145 6815 21025 2209Turkey 76 87 6612 7569 5776UK 90 9 810 81 8100Total 1548 1180 68626 147118 130446


• There is strong linear relationship between the twovariables.

−−

−=

])()(][)()([ 2222 y yn x xn

y x xynr

])1180()147118(20[])1548()130446(20[

)11801548()68626(2022 −−

−=

x

xr

79.0−=r

388

3/1/2010


• Interpretation option:– 100% r 2:


• Hypothesis Testing for a Correlation Coefficient• As that of mean and percentage, it is also possible to



98

100% r :• Shows proportion of variation of a variable

explained by the other. – Rule of thumb:

389

As that of mean and percentage, it is also possible totest significance about population correlation.

• For two tailed test – H0: r is 0 – H1: r is different from 0

• The t test statistic is given as (with n-2 df):

212

r nr t −−=

390


Example 8.2:• At the 0.05 level of significance, can we claim the

correlation coefficient in example 8.1 indicates significantnegative relationship between immunization coverageand child mortality?

391

Correlation Analysis Cont..

• The critical t value for 0.05 level of significance at 18degree of freedom is - 1.734. Then we calculate the teststatistics.

• Hence we accept the H 1 that r indicates significantnegative relationship between immunization coverageand child mortality.

5.47)0.3759

18(79.0)

)79.0(1220

(79.01

222

−=−=−−

−−=

−−

=r

nr t

392

3/1/2010


Limitations:• Applied only to a linear relationship.


Spearman’s Rank Correlation• It is a nonparametric (distribution-free) rank statistic



99

pp y p• One must not extrapolate an observed correlation

beyond observed ranges of the x and y value.• Does not differentiate dependent and independent

variable.• Confounding by a third variable.

393

p ( )proposed by Charles Spearman in 1904 as a measure ofthe strength of the associations between two variables

• Denoted as r s

• Is applied when:• Normality assumption is not satisfied or can not be

tested,• At least one of the variable is given in ordinal scale,

• In the calculation of the coefficient, actual values of bothvariables should be changed into ranks.

394

Correlation Analysis Cont..• The formula for the Spearman Correlation Coefficient is

(given that there is no tied rank):

• Where; – 6 is a constant, – D is the difference between a subjects ranks on the

two variables, – n is the number of subjects.

• Consider the following example.

)1(

)(61 2

2

−−=

nn

Dr s

395

Correlation Analysis Cont..Countries

MMR(Per100,00

0LB)

MMRRank

DeliveryService

Coverage(%)

Rank D D 2

1 315 4 55 6 -2 4

2 450 6 40 5 1 1

3 200 1 70 8 -7 49

4 250 3 79 10 -7 49

5 243 2 75 9 -7 49

6830

9 25 3 6 36

7 850 10 20 2 8 64

8 656 7 20 1 6 36

9 701 8 30 4 4 16

10 410 5 60 7 -2 4

308

The following tablepresents the MMR leveland delivery servicecoverage in 10 developingcountries.

= 1- [(6x308)/10(100-1)]= 1-[1848/990]= 1-1.87= -0.87

)1(

)(61 2

2

−−=

nn

Dr s

396

3/1/2010

Correlation Analysis Cont..• Inference about r s

• For hypothesis testing t score can be calculated (at df ofn-2) using the formula;


Partial Correlation• A method used to describe the relationship between two



100

n 2) using the formula;

• For the previous example the t score would be;

• If the hypothesis test is a two tailed test at 0.05 level ofsignificant, we reject the H 0 as 5 > 2.306.

21 2

−−

=

n

r

r t

s

s

5

210

)87.0(1

87.02

=

−

−−

−=t

397

variables while taking away the effects of anothervariable, or several other variables, on this relationship.

• Still requires meeting all the usual assumptions ofPearsonian correlation.

• But the covariate may not be necessary numeric.

398

Correlation Analysis UsingSPSS• In order to do correlation analysis using SPSS follow the

following steps;• Analyze > Correlate > Bivariate correlations > Put the

two variables in the variable box > Select Pearson orSpearman (another option is also there) > OK.

• Partial correlation can also be done.

• Analyze > Correlate > Partial correlation.• But before that, don’t forget the scattered plot.

399

Regression Analysis

• In correlation analysis the interest is to show how twonumeric variables are related.

• However in regression analysis, we are interested inexplaining or modeling a dependent variable (Y) as afunction of one or more independent variables (X).

• Regression analysis is used to:

– Assess association between two variables. – Predict/explain the value of a dependent variable

based on the value of at least one independentvariable. (i.e. Mathematical modeling)

– Control for confounding factors. – Show possible effect of interaction among variables. 400

3/1/2010

Regression Analysis Cont..• The general regression equation is given as:

Y = + 1X1+ 2X2……. nXn

Linear Regression

• Also known as linear least squares regression.• It is by far the most widely used modeling method.



101

Where: Y is the value of the dependent variable,X is the independent variable,

is the intercept,is the coefficient of the independent variable

• If the equation has only one independent variable theregression is called Simple Regression

• If multiple independent variables are involved it is calledMultiple Regression.

• In public health the most commonly used types ofregression analysis are: Linear and Logistic Regression

401

• The dependent variable is assumed to be a linearfunction of one or more independent variables plus anerror introduced to account for all other factors.

• Where Y is the dependent variable, Xs are theindependent variable and E is the random error term.

• The DV (Y) is given in continuous numeric scale whilethe IV/s (X) can be of any type. (mostly numeric variable)

ε β β β α ++++= nn x x xY .........2211

402

Linear Regression Cont..• The equation provides what value the DV would have for

a given value/s of the IV/s.• For example if we develop a linear model with the DV of

body height and the IV of serum growth hormone, wecan predict height for a person with a given value ofserum GH.

• Can be simple or multiple regression.• It attempts to model the relationship between the

dependent and independent variables by fitting a linearequation to observed data.

403

Linear Regression Cont..• A scattered plot is helpful to assesses the presence of

linear trend of association.• Consider the following data showing the number of

households in China with TV.

404

3/1/2010

Linear Regression Cont..

• If we plot these data, we get the following graph.


• Although no straight line passes exactly through thesepoints, there are many straight lines that pass close tothem Here is one of them



102

405

them. Here is one of them.

406

Linear Regression Cont..• How would you draw a line through the points? How do

you determine which line ‘fits best’?• The most common method for fitting a regression line is

the method of least-squares.• This method calculates the best-fitting line for the

observed data by minimizing the sum of the squares of

the vertical deviations from each data point to the line.• “Best fit” means difference between actual Y values &

predicted Y values are minimum.• Hence, linear regression is a method of finding the linear

equation that comes closest to fitting a collection of datapoints.

407


ε 2

Y

X

ε 1 ε 3

ε 4

^^

^^

Y X 2 0 1 2 2==== ++++ ++++

β ββ β β ββ β ε εε ε

Y X i i ==== ++++ β ββ β β ββ β 0 1

L S m in im iz e s

ε εε ε ε εε ε ε εε ε ε εε ε ε εε ε i

i

n 2

112

22

32

42==== ++++ ++++ ++++

====

408

3/1/2010


• Suppose that we used the line rather than the datapoints to estimate the number of households with TV.• Then we would get slightly different values from the


• The better our choice of line, the closer the predictedvalues will be to the observed values.• The difference between the predicted value and the



103

• Then we would get slightly different values from theoriginal observed values shown above. These values arecalled predicted values.

Year (X)(0 represents 2000)

Households with TV (millions)Observed Values

Households with TV (millions)Predicted Values Residual

0 68 62 6

1 72 70 2

2 80 78 2

3 83 86 -3

409

• The difference between the predicted value and theobserved value is called the residue.

• Residue = Observed Value - Predicted Value• The best line is the line with the smallest sum of squares

of error (SSE). (i.e. list square estimation)• SSE = Sum of squares of residues = Sum of (y observed –

y predicted )2

410

Linear Regression Cont..• The manual calculation for the coefficients of linear

regression is possible when we have one independentvariable. i.e.:Y = + X

• As that of correlation analysis, here we should have aset of paired DV and IV values for all study units.

• The line which represent the dataset (Y = + X) iscalculated using the formula:

•

−

−=

])(

[

][2

2

n

x x

n

y x xy

β x y β α −=

411


!

"

• Consider the following data.

• First we should plot a scattered diagram.

412

3/1/2010


Y

Linear Regression Cont…

( )( )101511

=

Y X n

ii

n

iin



104

01234

0 1 2 3 4 5 6X

413

( )( )

( )

( )( ) 10.0370.02ˆˆ

70.0

515

55

51015

37ˆ

10

2

1

2

12

11

11

−=−=−=

=−

−=

−

−=

=

=

==

=

X Y

n

X

X

nY X

n

i

n

ii

i

ii

iii

β β

β

414

Linear Regression Cont…• One of the indices to measure model goodness of fit for

simple linear regression is R-squared or coefficient ofdetermination.

• It is the proportion of variation explained by the best linemodel.

• It depends on the ratio of sum of square error from the

regression model (SSE) and the sum of squares differencearound the mean (SST = sum of square total).

• Where:415

Linear Regression Cont…• For multiple linear regression adjusted r squared is used.• For general rule of thumb, the R-squared or adjusted R-

squared should be higher than 0.80 to produce a goodlinear model.

• If your R-squared is less than 0.5, it is recommendedthat you consider other type of model rather than linear

model.

416

3/1/2010


Interpretation of linear regression coefficient:• Let’s consider the following simple linear reg equation;• Y = + X


Example 8.3:• Assume that the duration of breast feeding in weeks (Y)was found to be positively correlated with maternal age



105

• Y = + X• represents the slope, and represents the y-intercept.• The slope represents the estimated average change in Y

when X increases by one unit.• The intercept represents the estimated average value of Y

when X equals zero. (Practically less important)• When we represent a binary independent variable (coded

as 0-1), the slope represents the estimated averagechange in Y when you switch from 0 to 1.

417

was found to be positively correlated with maternal agein years(X). A linear regression model was developed toexplain the association. The equation is given as Y =5.92 + 0.389X. How do you want to explain theequation?

418

Linear Regression Cont…Assumptions:• Normal distribution: Regression assumes that variables

have normal distributions.• Homoscedasticity: The variance of the error terms is

constant for each value of x.• Linearity: The relationship between each x and y is linear.

• Normally distributed error terms: The error terms follow thenormal distribution.• Independence of error terms: Successive residuals are not

correlated.• No multicolinarity: The independent variables are not

correlated each other.419

Linear Regression Cont…Hypothesis testing in linear regression:• Questions to be answered through the hypothesis testing

are: – Does the entire set of independent variables contribute

significantly to the prediction of y? – Does the addition of one particular variable of interest

add significantly to the prediction of y achieved by theother independent variables already in the model?• The null and alternative hypothesis are given as:

– H0: 1 = 2 = · · · = p = 0 – H1: j 0 for at least one j.

420

3/1/2010


• F test and t test are used to test the hypothesis.• F is a test for statistical significance of the regressionequation as a whole. It is obtained by dividing the

Linear Regression Using SPSS

• Analyze > Regression > Linear Regression > Put thedependent and independent variables > Selectappropriate statistics > Ok.



106

q y gexplained variance by the unexplained variance.(Given as ANOVA table)

• T test is used to see whether that a specific variable issignificant in explaining the dependant variable or not.

421 422

Logistic Regression

424

Introduction• Logistic Regression is a model used for prediction the

probability of occurrence of categorical event by fitting datainto a Logistic Curve .

• Common dichotomous dependant variables are likedisease status (healthy or ill), clinical outcome (alive ordead), treatment outcome (success or failure), utilization

health commodities (utilization or non-utilization) etc.• Application:

– Modeling for risk prediction, identification ofdeterminants and health programming,

– Controlling confounding and interacting factors.

3/1/2010

Introduction Cont……• Comparative advantage of Logistic Regression

– Fewer assumptions, – Mathematically amenable,

Logistic Regression Function

• Binary dependant variable are coded as 0 or 1.• The probablity of the distribution is equal to the proportion

of 1s in the distribution (P)



107

425

– Easier interpretation.

• Classification of Logistics Regression (LR):

– Binomial LR: Dependant variable is dichotomous.

– Multinomial LR: Dependant variable with more thantwo classes.

– Ordinal LR: Dependant variable with multiple andranked classes.

426

of 1s in the distribution (P).

• The logistic function associates the Independent Variable(IV) X with the probability of occurrence of the DependantVariable (DV) Y.

• The function is given as:

427

LR Function Cont…• The function is represented by S shaped “Sigmoid graph”

which is called the Logistic Curve .• Examples:

428

LR Function Cont…• Derivation of the function can be demonstrated with an ex.• Suppose, we want to predict the person’s sex based on the

person's height.• Let's say the probability of being male at a given ht is 0.9• Odds (P/1-P) of being male = 0.9/0.1 = 9• Odds of being female = 0.1/0.9 = 0.11

• However the values look asymmetrical.• Can be corrected by the application of ln.• ln(9) = 2.217 and ln(0.11) = -2.217• The over all transformation is Logit Transformation .• The log of odds is abbreviated as the Logit .

3/1/2010

LR Function Cont…Mathematically:

x p

p β α +=

−1ln

LR Function Cont…

• One of the advantages of Logistic Regression: it ispossible to compute OR from its coefficient.

• Let’s assume a researcher is interested to study the effect



108

429

xe p

p β α +=−1

x

x

ee

P β α

β α

+

+

+=

1

zeP −+

=1

1 nn x x x zwhere β β β α ........2211 ++=

430

of smocking as predicting variable (X) on dependantvariable lung cancer (Y).

– X can be present (X=1) or absent (X=0),

– Y can be present (Y=1) or absent (Y=0),

X Y P

Y P β α +==− =)1(1

)1(log

431

LR Function Cont…

• Hence;

• The OR = Odds of smokers ÷ Odds of non-smokers

[ ] )1()1 / 1(log β α +=== X Y odds

[ ] )0()0 / 1(log β α +=== X Y odds

α

β α

eeOR

)1(+=

β eOR =432

Assumptions of Logistic Regression• Logistic Regression has fewer assumptions than Linear

Regression: – The DV need not be normally distributed. – Normally distributed error terms are not assumed. – Error terms should not be homoscedastic for each

level of the IVs.

3/1/2010

Assumptions of LR Cont…

But it has the following assumptions:

1. Data type: A dichotomous or polytomous DV.

2. Inclusion of all relevant variables and exclusion of the

Assumptions of LR Cont…

5. No multicollinearity: As the IVs increase in correlation witheach other, the standard errors become inflated.

– A standard error > 2.0.



109

433

irrelevant ones: i.e. Based on scientific framework orstatistical cutoff point (P=0.3).

3. No interaction: LR doesn’t consider interaction effectsexcept when interactions are created as a variable.

4. No outliers and influential cases: Such cases can affect the

model significantly.

434

– Examining the correlations and associations b/n IVs – Tolerance and VIF.

6. No outliers and influential cases: Such cases can affectthe model significantly.

7. Large samples:

– The minimum Ratio of Valid Cases to Variablesshould be at least 10:1. The preferred ratio is 20:1.

435

Assumptions of LR Cont…8. Linearity:

– Linear relationship b/n numeric IVs & the logit of the DV.

– If not the model underestimates association, lacks power.

– Box-Tidwell Test: If there is non linearity for numeric IVX, [(X)*ln(X)] interaction term become significant in model.

436

Fitting Logistic Model to a Dataset• In Linear Regression, the fitness of the model into the

dataset is achieved through List Square Estimation(LSE).

• In Logistic Regression LSE can’t be used.

• In its place Maximum Likelihood Estimation (MLE) isused.

• MLE relies on the concept of Likelihood .

• The likelihood of a set of data is the probability of obtainingthat particular set of data, using a given model.

3/1/2010

Fitting Logistic Model Cont…

For example:• Dataset B has five cases. Observed values for Y are(1,0,1,0,1)

h d l d h b b l f f

Fitting Logistic Model Cont…

• Mathematically it is easier to work with the Log likelihood.[ ]

=

−−+=n

iii P yP y B L

1

)1ln(1)ln()(ln



110

437

• The model predicts the probability of occurrence of Y is 0.7(i.e. Probability of Y=1 is 0.7, and Y=0 is 0.3)

• Likelihood of B is the joint probability of predicting thecorrect observed value of Y for every case using the model.

• i.e. L (B)=(0.7)(0.3)(0.7)(0.3)(0.7)=0.03087

∏=

−

−=

n

i

yi yi

pP B L1

1

)1()(

438

• Maximum Likelihood picks the values of the modelparameters that make the data "more likely" than anyother values of the parameters would make them.

• The MLE of the parameter P is that value of P that

maximizes L or ln L.

439

Fitting Logistic Model Cont…• Iteration: Repeated testing of the data and tuning of the

model parameter to provide the best fitting equation.• Once P is determined, then and are estimated.

Probability 440

Interpretation of Reg. Coefficients

• is called the Intercept and 1, 2, and so on, are called theRegression Coefficients of x 1, x2,…, respectively.

• is the value of Z when the value of all risk factors is zero.

• A +Ve coefficient means the risk factor increases the

probability of the outcome, while a -Ve means the opposite.• A large coefficient means that the risk factor strongly

influences the probability of the outcome; while a near-zeromeans the opposite.

zeP −+

=1

1nn x x x zwhere β β β α ........2211 ++=

3/1/2010

Hypothesis Testing in Logistic Reg.

• In Logistics Regression t or F test statistic can not be used

for hypothesis testing since it has Bernoulli Distribution.• Options:

Th (l ) Lik lih d R i S i i ( 2LL)

Hypothesis Testing LR Cont….

A. Likelihood Ratio Test Statistic (-2LL):• Usually two nested models (the Full and Reduced

Models) are presented.



111

441

– The (log) Likelihood Ratio Statistic (-2LL),

– The Wald Test,• All test either of the following null-hypothesis:

– Ho: 1 = 2 = 3 = ………… n = 0 – Ho: Removing an IV from the model doesn’t change its

the predictive ability.

442

• Reduced model mean a model from which a variable ispurposely omitted.

• Ho: The removed variable is not significant in the model.

• -2 Log L = -2 [log L Reduced model – Log L Full model]

−=

modmod

log2 fullof L

reduced theof Lstatistic LR

443

Hypothesis Testing LR Cont….• If the full model explains the data `much better' than the

reduced model, the difference will be `large‘:

Reject the Ho that the removed variable is non-significant.

• If the reduced model explains the data as the full model,the difference will be close to 0:

Accept the Ho that the removed variable is non-significant.

• LRT ~ X 2 df = number of removed variables.

444

Hypothesis Testing LR Cont….B. Wald Statistic:

• Commonly used to test the significance of coefficients foreach independent variable.

• Ho: A particular coefficient is zero.• W ~ X 2 df of 1.

• For a particular IV if the W is significant, then theparameter associated with this variable is not zero, so thatit should be included in the model.

β β

of Variencetest Wald

2

=

3/1/2010

Pseudo R-Squares

• In Linear Regression, R 2 measures proportion of varianceof DV explained by the predictors.

• Ranges from 0-1

Pseudo R-Squares Cont….

A. Cox and Snell’s Pseudo R2

N

Intercept

ML

M L R

/ 2

2

)(

)(1 −=



112

445

• Logistic Regression doesn’t have an equivalent to the R 2

• However, there are varieties of Pseudo R 2 which aredesigned to simulate the real R 2.

• Common used: Cox & Snell R 2 and Nagelkerke R 2

• Pseudo R 2 doesn’t mean what R 2 exactly means in LinearRegression: Interpretation should be made with caution.

446

B. Nagelkerke Pseudo R 2

Full M L )(

N Intercept

N

Full

Intercept

M L

M L

M L

R / 2

/ 2

2)(1

)(

)(1

−

−

=

447

Goodness of Fit AnalysisA. Hosmer-Lemeshow Statistic

• The recommended test for overall fitness of a LogisticRegression model,

• A type of chi-square test but considered stronger than thetraditional chi-square test, particularly if continuouscovariates are in the model or sample size is small.

• HL statistic first sort observations in increasing order oftheir estimated event probability and divides observationsinto deciles based on the predicted probabilities.

• HL statistic ~ X 2 df of 8.448

Goodness of Fit Analysis Cont…

• Where – n j is Number of observation in the j th group – O j is Observed number of cases in the j th group

– E j is Expected number of cases in the j th group

• Non-significance means the model adequately fits the data.• P value of 0.05 is considered as level of significance.

8)1(

)( 210

1

22 of df

n

E E

E OG

j

j

j j

j j HL χ ≈

−

−=

=

3/1/2010

Goodness of Fit Analysis Cont…

B. Loglikelihood Statistics

• A good model is the one that results in a high likelihood ofthe observed results.

Logistic Regression Using SPSS

• Analyze > Regression > Binary Logistic >Put thedependent and independent variables > Mark categoricalindependent variables > check for the options > Ok.

Or



113

449

• This translates into a small value for -2LL.

• If a model fits perfectly, the -2LL would be 0.

• Since there is no acceptable upper cutoff point for -2LLtest, it is difficult to interpret the meaning of the score.

• Less commonly used.

• Analyze > Regression > Multinomial Logistic > Put thedependent variable > Put the independent variables asfactors or covariates depending on their nature > checkfor available options > Ok.

450

Analysis of Variance(ANOVA)

ANOVA• Used to compare mean of a quantitative variable across

different categories of a categorical variable.• The specific type is called One-way ANOVA.• If two covariates are involved it is called Two-way ANOVA.• If the categorical variable has only 2 values: 2-sample t-

test can be used.• ANOVA allows for comparison among 3 or more groups.• ANOVA is helpful because it possess a certain advantage

over a two-sample t-test.• Doing multiple two-sample t-tests would result in a largely

increased chance of committing a type I error.

452

3/1/2010

ANOVA Cont…• ANOVA functions by checking whether the differences

between the groups are significant depends on: – The difference in the means – The standard deviations of each group

ANOVA Cont…

Assumptions of ANOVA:• Each group is approximately normally distributed,• Observed data constitute independent random samples

from the respective population,



114

– The sample sizes• ANOVA determines P-value from the F statistic.• Hypothesis:

– H0: The means of all the groups are equal. – H1: Not all the means are equal.

• Doesn’t explain which ones differs.• Once a global difference is detected, it should be follow

up with “multiple comparisons” (Post hoc test) to identifyspecific differences. 453

from the respective population,• Standard deviations of each group are approximately

equal – Rule of thumb: ratio of largest to smallest sample

standard deviation must be less than 2:1

454

ANOVA Cont…• ANOVA is a technique whereby the total variation

present in a dataset is segregated into severalcomponents.

• Variation is the sum of the squares of the deviationsbetween a value and the mean of the value.

• Sum of square (SS) is another name for variation.

• ANOVA measures two sources of variation in the dataand compares their relative sizes. – Between group variation – Within group variation

455

ANOVA Cont…Between group variation:• Is there some variation between the groups?• Sometimes called the variation due to the factor.• Denoted SS(B) for Sum of Squares (variation) between

the groups.• Calculated as follows (given x double bar is the grand

mean):

=

−=k

iii x xn BSS

1

2)()(

=

−−+−=k

inn x xn x xn x xn BSS

1

2222

211 )(.........)()()(

456

3/1/2010

ANOVA Cont…

Within group variation :• Is there some variation within the groups?• Sometimes called the error variation as it is the variation

that can’t be explained by the factor.

ANOVA Cont…Variance:• Based on the variation (SS), variance is calculated for

both categories.• The variance is also called the Mean of the Squares and

abbre iated b MS often ith an accompan ing ariable



115

p y• Denoted SS(W) for Sum of Squares (variation) within

the groups.• Calculated as follows given n is the sample size for

every group.

=

−=k

iii snW SS

1

2)(1)(

2222

211 )(1........)(1)(1)( nn snsnsnW SS −−+−=

457

abbreviated by MS, often with an accompanying variableMS(B) or MS(W).

• Calculated by dividing the variation by the df• MS = SS / df• The between group df is one less than the number of

groups (k-1)

• The within group df is the sum of the individual dfs ofeach group. Or in other words it is (n-k)

458

ANOVA Cont…The F distribution:• Used as test of significance in ANOVA.• The F distribution is defined as the distribution of

(Z/n1)/(W/n2), where Z has a chi-square distribution withn1 df, W has a chi-square distribution with n2 df, and Zand W are statistically independent.

• In ANOVA F test statistic is the ratio of two samplevariances. (MSB/MSW).

• The df for the numerator are the df for the betweengroup (k-1) and the df for the denominator are the df forthe within group (n-k).

• A large F is evidence against H 0, since it indicates thatthere is more difference b/n groups than within groups.

459

ANOVA Cont…Example:• Suppose we have three groups:

– Group 1: 5.3, 6.0, 6.7 – Group 2: 5.5, 6.2, 6.4, 5.7 – Group 3: 7.5, 7.2, 7.9

• Then we computer ANOVA F statistic in the followingmanner.

460

3/1/2010

ANOVA Cont…WITHIN BETWEENdifferenc e: difference

group data - group mean group mean - overall meandata group mean plain squared plain squared

5.3 1 6.00 -0.70 0.490 -0.4 0.1946.0 1 6.00 0.00 0.000 -0.4 0.1946.7 1 6.00 0.70 0.490 -0.4 0.194

ANOVA Cont…

ANOVASource of Variation SS df MS F P-value F crit

Between Groups 5.127333 2 2.563667 10.21575 0.008394 4.737416Within Groups 1.756667 7 0.250952



116

5.5 2 5.95 -0.45 0.203 -0.5 0.2406.2 2 5.95 0.25 0.063 -0.5 0.2406.4 2 5.95 0.45 0.203 -0.5 0.2405.7 2 5.95 -0.25 0.063 -0.5 0.2407.5 3 7.53 -0.03 0.001 1.1 1.1887.2 3 7.53 -0.33 0.109 1.1 1.1887.9 3 7.53 0.37 0.137 1.1 1.188

TOTAL 1.757 5.106TOTAL/df 0.25095714 2.55275

overall mean: 6.44 F = 2.5528/0.25025 = 10.21575 461

W t G oups . 5666 0. 5095

Total 6.884 9

1 less than numberof groups

number of data values -number of groups(equals df for eachgroup added together)1 less than number of individuals

(just like other situations) 462

ANOVA Using SPSS

463

• Analyze > Compare means > One way ANOVA > Putthe continuous variable under “Dependent list” > Put thecategorical variable under “Factor” > Select “Post hoc”tests > Ok. Thank You

464

Documents

Biostat Lecture Note