Upload
miki-sileshi
View
242
Download
0
Embed Size (px)
Citation preview
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 1/116
3/1/2010
1
Biostatistics
Samson G/Medhin, MPH
Course ObjectiveGeneral Objectives:
• To acquaint students with the basic and intermediatestatistical concepts and tools for collecting, analyzing,presenting and drawing conclusions from data.
2
Course Objective
Specific objectives:At the end of the course students will be able to:• Describe the scope and application of statistics;• Acquaint with the types of variables and scale of
measurements;• Describe data with appropriate diagrammatic and
numeric summery techniques;• Understand the basic rules of probability and their
statistical application in health sciences;• Comprehend different sources of health and
demographic data and appreciate their respectiveadvantage and disadvantage;
3
Course Objective Cont…• Understand the basic sampling techniques;• Calculate optimal sample size for different types of
studies;• Calculate and interpret confidence intervals;
• Carryout hypothesis testing about different statisticalparameters;• Understand and apply intermediate statistical methods
including correlation, linear regression, logisticregression and ANOVA;
• Carryout exploratory data analysis using SPSS;• Understand and interpret statements in published
articles pertaining to statistics.4
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 2/116
3/1/2010
2
Time Schedule• Time Schedule.doc
5
Mode of Evaluation• Mid 35%• Final 40%• Assignments/Quiz 10%• Term paper 15%
6
References1. M. Pagano and K. Gauvreau. Principles of Biostatistics, 2 nd ed.,
Duxbury Thompson Learning, 2000.2. T. Colton. Statistics in Medicine, Lippincott Williams & Wilkins
Publisher, 1974.3. B. Rosner. Fundamentals of Biostatistics, 6 th ed., Thomson
Books, 2006.4. M. Bland. An Introduction to Medical Statistics, 5 th ed., OxfordMedical Publications, 1993.
5. W. Daniel. Biostatistics: A Foundation for Analysis in HealthSciences, 8 th ed., John Wiley and Sons Inc, 2005.
6. Landau S and Everitt BS. Handbook of Statistical Analysesusing SPSS, Chapman & Hall/CRC, 2004.
7
Introduction
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 3/116
3/1/2010
3
What is Statistics?• Statistics is a field of study concerned with the collection,
organization and summarization of data, and drawing ofinferences about a body of data when only part of the datais observed.
• It is concerned with: – Designing experiments and data collection, – Summarizing information to aid understanding, – Drawing conclusions from data, – Estimating the present and predicting the future based
on Statistical evidence.9
What is Statistics?• Mathematical statistics: Concerns with the development
of new methods of statistical inference and requiresdetailed knowledge of abstract mathematics.
• Applied statistics: Involves applying the method ofmathematical statistics to specific subject areas.
• Biostatistics is an application of statistical method toBiological phenomena.
10
What is Statistics cont…
• In clinical medicine and PH Statistics can be applied to: – Determine the accuracy of measurement, – To compare measurement techniques, – To assess diagnostic tests, – To determine normal value, – To estimate prognosis, – To compare efficacy of treatment techniques, – To determine prevalence of an event, – To identify determinates of health problem, – To compute adequate sample size for studies. – Etc.
11
Statistical Data
• Refers to numerical description of things through theform of count or measurement.
• Though statistical data always involves numericdescription, all numeric descriptions are not statistical
data.• Statistical data should have the following characteristics:
– They must be in aggregate, – They must be affected to marked extent by multiple causes, – They must be collected in systematic manner, – They must be estimated at reasonable accuracy, – They must be placed in relation to each.
12
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 4/116
3/1/2010
4
Classification of Statistics• Descriptive Statistics: Is the methodology of effectively
collecting, organizing and describing data.• Inferential Statistics: Includes:
• Inductive Statistics: The process of drawingconclusion about unknown characteristics of apopulation, based on sample based study.
• Predictive Statistics: The process of predicting futurebased on historical data.
13
Classification Cont..• During analysis based on the underlying assumptions,
statistics (statistical methods) can be classified as:• Parametric statistics: is a branch of statistics that
assumes data come from a type of probabilitydistribution and makes inferences about the data basedon the distribution.
• Non-parametric statistic: Interpretation does not depend
on the population fitting any distributions.
14
Rationale of StudyingStatistics
• Enable to organize information in formal manner.• Issues in science are becoming more and more
quantitative,
• Statistics is extensively used in medical literature.• The planning, conducting and implementing of medicaland public health research are highly reliant on statisticalmethods.
• There is a great deal of intrinsic variations in mostbiological process.
15
Possible Limitations ofStatistics
• It mainly deals with variables which can be quantified.• It deals on aggregate of facts; it may not give individual
information.• Highly reliant on cutoff points.• Analysis is done based on multiple assumptions.• Errors are possible in statistical decisions.
16
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 5/116
3/1/2010
5
Types of Variables• A variable is any characteristic of a study unit (example
an individual) that is measureable and/or classifiable,and can take any value for different units.
• Depending on their quantifiablity, can be classified asQualitative and Quantitative variables.
• Qualitative (Categorical) Variable: is a characteristicwhich can not be measured in quantitative form but can
be identified by names or categories. For examplereligion, ethnicity, illness status (well or ill), treatmentoutcome (improved or not improved), Stage of breastcancer (I, II, III, IV) etc
17
Types of Variables Cont…• Quantitative Variable: is a characteristic that can be
measured and expressed numerically.• This can be of two types:• Discrete Quantitative Variable:
– Can only take on a finite number of values (usually wholenumbers).
– Example: number of children, number of episode of illness.
• Continuous Quantitative Variable: – Measured on continuous scale. – It can assume infinite number of values between two given
values. – Example: height, weight, age, blood sugar level.
18
Scale of Measurement
• In clinical medicine and public health as in many otherareas of science, we typically assign numbers to variousattributes of people, objects, or concepts.
• This process is known as measurement.
• The process of measurement involves assigningnumbers to observations according to rules.
• The way that the numbers are assigned determines thescale of measurement.
• Four scales of measurement are typically discussedhere.
19
Scale of Measurement Cont…
Nominal Scale:• Is the lowest scale of measurement.• Numbers are assigned to categories as "names"
arbitrarily.• Therefore, the only number property of the nominal scale
of measurement is “identity”.• For example classifying people according to gender is a
common application of a nominal scale. We may assignnumber "1" to "male" and number "2" to "female" or theopposite. The only mathematical operation we canperform with nominal data is to count.
20
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 6/116
3/1/2010
6
Scale of Measurement Cont…Ordinal Scale:• Ordinal scale has the property of magnitude.• It assigns each measurement to one of a limited number
of categories that are ranked in terms of graded order.• However the interval between the categories is not
necessarily equal.• Example: Cancer stage, rank in a race.
21
Scale of Measurement Cont…Interval Scale:• Interval scale has property of equal interval b/n values.• It doesn’t have a true zero point; the number "0" is
arbitrary.• Similarly the ratio between two values on interval scale
doesn’t have meaningful interpretation.• Eg: in measuring temperature using 0C scale, we can
always be confident that the distance between 25 0C and35 0C is the same as the distance b/n 65 0C and 75 0C.
• However, 0 0C doesn’t mean there is no temperature.Similar, it would be inappropriate to say that 60 0Cdegrees is twice as hot as 30 0C degrees.
22
Scale of Measurement Cont…
Ratio Scale:• Ratio scale of measurement has the property of equal
interval between values and absolute/true zero.• These properties allow us to apply all mathematical
operations (addition, subtraction, multiplication, anddivision) in data analysis.
• The absolute/true zero allows us to know how manytimes greater one case is than another.
23
Data Collection Method
• In order to generate valid conclusion from a data,information has to be collected in a systematic manner.
• A haphazardly collected dataset is less likely to producevaluable and generalizable information.
• Data may be derived from several sources.• Depending on the source, it can be classified as Primary
or Secondary data.• Primary data is gathered for the first time by the
researcher for a given purpose; while,• Secondary data is data already collected by others, for
purposes other than the question of the research at hand.24
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 7/116
3/1/2010
7
Data Collection MethodCont…
Survey through interview:• A quantitative approach in which a standardized
questionnaire, to be administered through interview, isused to collect information.
• Advantage – Quick and inexpensive, – Responses from different respondents is comparable, – Easy to quantify and analyze, – Useful in describing quantifiable characteristics of a
large population,
25
Data Collection MethodCont…
– Very large and representative samples are feasible, – Standardized questions make measurement more
precise, – Participants do not need to be able to read and write
to respond,
• Disadvantage: – Doesn’t give qualitative information, – Doesn’t give opportunity to probe and explore, – Relatively inflexible, – Less reliable to assess behavior and attitude of
respondents, 26
Data Collection MethodCont…
Survey through self administered questionnaire:• A quantitative method in which a standardized
questionnaire, to be filled by the respondentsthemselves, is used.
Advantage:• Quick and inexpensive,• Responses from different respondents is comparable,• Useful in describing quantifiable characteristics of a large
population,• Very large and representative samples are feasible,• Standardized questions make measurement more
precise. 27
Data Collection MethodCont…
• Disadvantage: – Participants need to be able to read and write to
respond, – High non-response rate, – Doesn’t give qualitative information, – Doesn’t give opportunity to probe and explore, – Less reliable to assess behavior and attitude of
respondents, – Relatively inflexible,
28
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 8/116
3/1/2010
8
Data Collection MethodCont…
Secondary data:• A quantitative approach which utilizes data already
collected by others.• Advantage:
– Less resource and time consuming,• Disadvantage:
– May not give in depth information, – No knowledge on the accuracy of data collection, – Can be outdated, – Limited control on the sampling method and size, – Less likely to give qualitative information.
29
Data Collection MethodCont…
Focus Group Discussion (FGD):• A qualitative method to obtain in-depth information on
concepts and perceptions about a certain topic throughspontaneous group discussion of approximately 6–12persons, guided by a facilitators.
• Advantage: – Excellent approach to gather information on in-depth
attitudes, and beliefs of a group, – Group dynamics might generate more ideas than
individual interviews, – Provides an excellent opportunity to probe & explore, – Participants are not required to read or write, 30
Data Collection MethodCont…
– Unearth sensitive issues which are not commonly raisedby individuals.
– It facilitates the exploration of collective memories.
Disadvantage: – Requires strong facilitator to guide discussion and
ensure participation by all members, – Doesn’t give quantitative information, – It is difficult to organize the discussion, – Analysis is relatively difficult.
31
Data Collection MethodCont…
In-depth interview:• A qualitative method that relies on person to person
discussion.• Advantage:
– Good approach to gather in-depth attitudes andbeliefs from individual respondents,
– Provides an excellent opportunity to probe andexplore,
– Participants don’t need to be able to read and write torespond,
– Assures privacy,32
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 9/116
3/1/2010
9
Data Collection MethodCont…
• Disadvantage: – Doesn’t give quantitative information, – It is time taking, – the respondent may feel like ‘a bug under a
microscope’, – The analysis is relatively difficult,
33
Data Collection MethodCont…
Observation:• A qualitative method that involves critical observation
and recording the practice (behavior, culture…) ofindividuals or a group.
• Excellent approach to discover behaviors,• Usually takes longer time,• Liable to “Observational bias”
34
Designing Questionnaire
• Most of the data collection techniques utilizequestionnaires.
• Hence, the quality of the data is dependant on how bestthe questionnaire is designed.
• There are two main objectives in designing aquestionnaire:• To obtain accurate relevant information for the study,• To maximize the response rate.
35
Designing QuestionnaireCont…
• A questionnaire can be classified based on different issues:• Structured Vs Non-structured Questionnaire:
– The structured one is mainly designed for surveys. – A series of questions are arranged in a logical order and
sequence and divided into subtopics. – Skipped patter is important for structured questionnaire. – The data collector is expected to smoothly go through the
sequence. – The non-structured one is commonly used for qualitative
studies. – It doesn’t have strict sequence of questions. – The data collector may rearrange the questions depending
on the response of the subject. 36
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 10/116
3/1/2010
10
Designing QuestionnaireCont…
• Open ended Vs Close ended Questionnaire(Question):
• Open ended questions permit free response that shouldbe recorded in respondent’s own word.
• Allows exploration of the range of possible themes.• Close ended questions offer a list of possible options or
answers from which the respondents must choose.
• It is relatively easy and quick to fill, code, analyze andreport.
37
Designing QuestionnaireCont…
Standardized Vs Non-standardized Questionnaire:• Standard questionnaire is developed by a well known
body and considered to be “standard” to assess a givenresearch question.
• A nonstandard one is developed by the researcher toaddress the research question.
• What are the advantages and disadvantages of using
standardized questionnaire?
38
Steps in Designing aQuestionnaire
1. Developing Individual Questions: – Use short and simple sentences. – Ask for only one piece of information at a time. – Ask precise questions to address the objective of the
study. – Give extra attention to sensitive questions. – Avoid leading questions.
2. Format of responses: Questions should be formattedinto open or closed formats depending on the need.
39
Steps Cont…
3. Arranging the Questions:• Go from general to particular.• Go from easy to difficult.• Go from factual to abstract.• Start with closed questions.• Start with demographic and personal questions.4. Piloting and Evaluation of Questionnaire.• Given the complexity of designing a questionnaire, it is
impossible even for the experts to get it right the firsttime round.
• Questionnaires must be pretested (piloted) on a smallsample of people characteristic of those in the survey. 40
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 11/116
3/1/2010
11
DiagrammaticSummarization
Introduction• Data collection yields a set of data called Raw Data.• The size of the data can range from a few hundreds to
many thousands of observations.• Raw data however will not necessarily provide
information that can easily be interpreted.• Data presentation is a mechanism which enables easier
understanding of a given set of data through the use of
tables and graphs.• In data summarization the detailness of the data is
compromised but this is compensated by gain inknowledge of the data.
42
Tables
• Simplest means of data presentation which can be usedfor all type of data.
Frequency Distribution
• One type information that is commonly used to organizedata in tables is Frequency Distribution.
• For nominal or ordinal data, the frequency distributionconsists of a set of categories along with numericcounts that correspond to each one.
• Example:
43
Tables Cont…Table 2.1: Ethnicity Composition of Women of Reproductive age in
Awassa Town, Jan 2006.
Ethnic Group Frequency Distribut ionWolita 377
Amhara 355
Sidama 163Oromo 144
Guragae 138
Kenbata 82
Tigray 47
Hadya 20
Others 50
Total 137644
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 12/116
3/1/2010
12
Tables Cont…• In displaying numeric data using frequency distribution
we should note the following:• The range of values must be broken-down into a series
of distinct and non-overlapping intervals.• The intervals should cover all data points.• Intervals are often constructed, though not necessarily,
so that all have equal width. This facilitates comparison
among classes.• Open ended intervals should be avoided.• The limits for each class must agree with the accuracy of
the raw data.45
Tables Cont…• Appropriate number of intervals should be considered as
too many intervals won’t be much explanatory and toofew intervals loose a great deal of information.
• The rule of thumb states the number of classes shouldbe between 10-20.
• When we don’t have any evidence to decide number ofclasses, we can use Sturge’s Formula:
• No of classes = 1+[3.322 x log (no of observations)]• The width of each class can also be calculated as:
)classesof No
Min value-Max value(classtheof Width =
46
Tables Cont…
Relative and Cumulative Frequency• In addition to counts, it is useful to know the proportion of
values that fall into a given class.• Relative frequency of a class is the proportion or
percentage of total number of observations that fall in agiven class.
• Cumulative relative frequency of a class is the proportion(percentage) of total number of observations that have avalue less than or equal to the upper limit of a giveninterval.
• If such information is given in the form of counts it issimply called Cumulative frequency.
47
Tables Cont…
Age Group Number of women Relative Frequency(%)
Cumulative RelativeFrequency (%)
15-19 399 28.9 28.9
20-24 341 24.7 53.6
25-29 281 20.4 74.0
30-34 143 10.4 84.3
35-39 116 8.4 92.8
40-44 54 3.9 96.7
45-49 42 3.0 100.0
Total 1380 100.0
Table 2.2: Cumulative and Relative Frequency of Age Structure of Women ofReproductive age in Awassa Town, Jan 2006.
48
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 13/116
3/1/2010
13
Tables Cont…• Depending on the number of variables represented in,
tables can be classified as one way, two way and higherorder tables.
• One-way Table: Only one variable is summarized in thetable.
• Two-way Table (Cross tabulation): Two variables areorganized simultaneously in combined manner in a table.
• Higher Order Table: Three or more variables arepresented simultaneously in a table. The higher orderthe table the more complicated the interpretation.
49
Tables Cont…Child Ever Born
>=5 < 5
E d u c a t i on al s t a t u s of w om e n
Illiterates 42 68
Read and Write 9 19
1st-4th grade 32 60
5th-8 th grade 46 211
9th
-12th
grade 42 239
> 12 th grade 7 68
Total 175 665
What type of table is this?
50
Tables Cont…Child’s Age Child’s Sex History of illness in the preceding 2 weeks
Yes No Total
0-11 mo
Male 15 86101
Female 18 84102
12-23 mo
Male 13 8093
Female 12 7890
24-35 mo
Male 10 7686
Female 11 7788
36-47 mo
Male 9 7483
Female 9 7382
48-59 mo
Male 6 6975
Female 7 7077
51
Tables Cont…
• In constructing tables, the following standards should befollowed: – Tables should be simple and self explanatory, – Every table should have a title (usually at the top of the table)
which indicates who, what, when, where of the data presented, – Row and columns should be labeled, – Totals should be indicated, – Numeric entities of zero should be written as “0” while missed or
unobserved data should be represented by “-”, – If the data are not original, there source should be given as
footnote, – Complicated tables should be avoided.
52
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 14/116
3/1/2010
14
Diagrammatic Representation
• A second way to present data is through the use of graphsor pictures. (Diagrammatic Representations).
• Though diagrammatic representation is easier to read thantables, they supply a lesser degree of details.
• However, the lesser detail can be compensated by a gainin understanding of the data.
• Diagrammatic representation has the following advantages:
– They are easier to understand and memorize, – They are more attractive, – They facilitate comparison among groups, – They may show pattern within the data set.
53
Bar Charts (Bar Graphs)
• Bar graphs are popular type of graph used to display afrequency distribution for Nominal or Ordinal data.
• In the case of the commonest Vertical Bar Graph(Column Graph), various categories into which theobservation falls are presented along horizontal axis.
• A vertical graph is drawn above each category so thatthe height of the bar represents either the frequency or
relative frequency of observations within that class.• The bar should have equal width, and separated from
one another so that not to imply continuity.• In the case of Horizontal Bar Graph, the vise-versa holds
true.54
Bar Charts Cont…Bar graph has different types:• Simple Bar Graph:
– Depicts the frequency /relative frequency of classes of a variable. – The intension is to compare the frequency of different classes of a
variable.
0
10
20
30
40
50
60
70
Within an hr 1-24 hr After the first day
The time breast feeding wa s initated
P e r c e n
t a g e o
f c
h i l d r e n a g e d
0 - 1 1
m o n
t h s
55
Bar Charts Cont…• Multiple Bar Graph:
– Depicts the frequency or relative frequency of classes of avariable at two or more situations.
– This type enables comparison between the levels of classes ofthe variable at different situations.
28
60
26
63.3
33.5
2.8
0
10
20
30
40
50
60
70
Wit hin an hr Wit hin a day A ft er t he firs t da y
The Time Breastfeeding was Initated
%Baseline
End line
56
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 15/116
3/1/2010
15
Bar Charts Cont…• Component Bar Graph: – Similar as that of simple bar graph except bars are divided into
components. – The graph shows the relative contribution of the components to
the bar (category).
0
10
20
30
40
50
60
70
W it hi n a n h r 1- 24 h r A ft er t he fi rs t da y
The time breastfeeding was initiated.
P e r c e n t a g e o f c h
i l d r e n a g e d
0 - 1 1
m o n t h s
Female
Male
57
Bar Charts Cont…• 100% Component Bar Graph:
– Similar as that of component bar graph. – But the height of all the bars is set at 100% so that comparison
on the relative contribution of the components can easily bemade.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Within an hr With in a day After the first day
Females
Males
58
Pie Chart
• A Pie Chart is a circular chart divided into sectors,illustrating relative magnitudes or frequencies of classesof a given variable.
• Pie chart usually represents categorical data but it is also
possible to use it for discrete quantitative data.• The angle of each sector has to be proportional to therelative frequency of a given class.
59
Pie Chart Cont….
60
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 16/116
3/1/2010
16
Histogram• Whereas Bar-chart is representation of a frequency
distribution for either nominal or ordinal data, a Histogramdepicts a frequency distribution for continuous data.
• The horizontal axis displays the true limit of the interval,the vertical axis represents the frequency or relativefrequency of the interval.
• If the interval of the bars is equal, the frequency
associated with each interval can be represented by theheight of the respective bars.• However if the bars have different width, the histogram
should be drawn in such a way that the Y axis representsthe frequency density and the X axis the interval.
61
Histogram Cont…• Then the respective frequency of the interval is
represented by the area of the bar.• Frequency density of an interval = frequency of the
interval /true class width.• Unlike Bar-graph, in the case of Histogram the
categories (bars) must be adjacent. Hence, in order toconstruct a Histogram, rather than class intervals, true
class boundaries should be used.• For example the following table summarizes theBiostatistics mid exam score of 38 students out of 35marks.
62
63
Frequency Polygon
• Frequency Polygon depicts a frequency distributioncontinuous numeric data.
• Frequency polygons are a graphical device forunderstanding the shapes of distributions.
• A Histogram can easily be changed to FrequencyPolygon by joining the mid points of the top of theadjacent rectangles of the Histogram with a line.
• It is also possible to draw Frequency Polygon withoutdrawing Histogram. The procedure is as follows:
64
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 17/116
3/1/2010
17
Frequency Polygon Cont…
1. Identify the mid points of all the intervals of the classesof the give data,
2. Plot the mid points (as X axis) with the respectivefrequency distribution or relative frequency of the class(as Y axis)
3. Connect adjacent plots with a straight line
65
Frequency Polygon Cont…
• For example the following Frequency Distributionrepresents the ages (in years) of 60 patients at apsychiatric counseling centre.
66
Frequency Polygon Cont…
• First we have to identify the mid points of each interval.
67
Frequency Polygon Cont…• Finally we have to plot the midpoints (as X axis) with respective
frequency of each class (as Y axis) and connect adjacent plots witha straight line.
68
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 18/116
3/1/2010
18
Scattered Plot (Scattered Graph)
• Scattered plot is used to show the relation between twodifferent continuous measurements.
• The scale for one quantity is marked on the X axis andthe scale for the other on the Y axis.
• Each point on the graph represents a pair of values forthe two measurements.
• For each value on the X axis, it is possible to have
multiple Y values.• The following scattered plot, shows the relation between
age and blood glucose level among diabetic patientsaged 50-70 years.
69
Scattered Plot Cont..
120
125130135140145150155160165170175180185190195200
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
Age in Years
B l o o d G l u c o s e l e v e l m g / d l
70
Line Graph
• A line graph is similar to scattered plot as it shows therelation between two different continuous measurements.
• Once again each point on the graph represents a pair ofvalues.
• However, unlike scattered plot, each value on the X axishas a single corresponding measurement on the Y axis.• As the name indicates, points on the graph are connected
to the adjacent points with straight line.• Most commonly the scale along the X axis represents time.
Consequently we are able to trace the chronologicalchanges.
71
Line Graph Cont…
Figure 2.8: Mean Number of Child Ever Born to Women at the Age of25 years, Awassa Town (1980-2005)
1
1.25
1.5
1.75
2
2.25
2.5
2.75
3
3.25
3.5
3.75
1980 1985 1990 1995 2000 2005
Year (GC)
M e a n
C h i l d E v e r B o r n a m o n g
W o m e n a t t h e
A g e o f 2 5
72
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 19/116
3/1/2010
19
Cumulative Line Graph
• Also known as Ogive Graph.• It is best used when you want to display the total at any
given time.• The relative slopes from point to point will indicate
greater or lesser increases.• For example, a steeper slope means a greater increase
than a more gradual slope.
• For example, if you saved $300 in both January andApril and $100 in each of February, March, May, andJune, the Ogive would looks like as follows.
73
Cumulative Line Graph
Cont…
74
Box and Whisker Plot
• In descriptive statistics box-and-whisker plot is aconvenient way of pictorially depicting groups ofnumerical data through their five-number summaries
• The smallest observation, 1 st quartile, median, 3 rd
quartile, and largest observation.
75
Box and Whisker Plot Cont…
• However in some cases the ends of the whiskers canrepresent several possible alternative values.
• For example In SPSS: – The ends of the whiskers represent lowest datum but
still within 1.5 times of the IQR of the lower quartile,and the highest datum still within 1.5 IQR of the upperquartile.
– Values more than three IQR’s from the end of a boxare labeled as extreme, denoted with an asterisk (*).Values more than 1.5 IQR’s but less than 3 IQR’sfrom the end of the box are labeled as outliers (o).
76
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 20/116
3/1/2010
20
Stem and Leaf Plot
• Is a display that organizes data to show its shape anddistribution.
• Each data value is split into a "stem" and a "leaf" portion.• The "leaf" is the last digit of the number and the other
digits to the left of the "leaf" form the "stem".• For example, the number 42 would be split apart, with
the stem becoming the 4 and the leaf becoming the 2.
• Consider the following dataset, sorted in ascendingorder: 8, 13, 16, 25, 26, 29, 30, 32, 37, 38, 40, 41, 44,47, 49, 51, 54, 55, 58, 61, 63, 67, 75, 78, 82, 86, 95.
77
Stem and Leaf Plot Cont…
0|81|3 62|5 6 93|0 2 7 84|0 1 4 7 95|1 4 5 86|1 3 77|5 88|2 69|5
78
Pictogram
• Pictogram is a graph which uses pictures or symbols topresent a certain data.
• Usually presents the frequency of one or morecategorical or discrete numeric variables in the form of
symbols.• The magnitude of the can be shown either by the size ofthe picture or the number of pictures.
• For example the following pictogram represents thenumber of passengers per year across four airports ofUK.
79
Pictogram Cont…
80
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 21/116
3/1/2010
21
Issues to be considered in
diagrammatic representation• Depending on the type of the data, the right type of
diagrammatic representation should be selected.• It is not common to use two or more types of
diagrammatic representation simultaneously for aspecific data. The best should be selected and used.
• Each graph and diagram should be labeled (usually thetitle is given below the figure).
• The title should indicate “Who”, “What”, “When” and“Where” of the data presented.• If the representation is taken from another source the
primary source should be indicated.
81
Issues to be considered
Cont…• In graphs, the X and Y axis should be indicated clearly
with their unit of measurement.• In graphs, the scale of X and Y axis should be drawn
proportionally.• Pictorial representations usually require “Key” to facilitate
easier interpretation.• When colors are employed, contrasting colors should be
selected.
82
Diagrammatic RepresentationUsing SPSS
• In order to develop graphs using SPSS, the followingsteps should be followed;
• Graphs > legacy dialogues > select appropriate graph• Available types are Bar graph, Pie chart, Histogram, Line
graph, Scattered plot and Box plot.• Other rarely used types are also there.• Most of the graphs can also be found under “Analysis >
Descriptive Statistics” icon.
83
Numeric Summarization
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 22/116
3/1/2010
22
Introduction
• Even though diagrammatic representation greatlyenhance understanding of the data, it does not givemathematically amenable outputs.
• This gap is addressed by numeric summarization.• In summarizing a dataset using numeric indicators, we
often focus on describing the data with two summaryfigures. These are:
– Central Tendency (Location) – Variation (Spread)
85
Measures of Central Tendency
• One of the most commonly used measures tosummarize a set of data is its center.
• The center is value (usually a single value), chosen insuch a way that it gives a reasonable approximation ofthe whole dataset.
• In statistics the number which tends to approximate thecenter of a set of data is called Measure of CentralTendency or Average.
• The Arithmetic Mean, Median and Mode are the mostcommonly used measures of central tendency.
86
Measures of CentralTendency Cont…
Attributes of good measure of central tendency are:• It should be based on all observations.• It should not be affected by extreme values.• It should have a definite value.
• It should not be subjected to complicated computation.• It should be capable of further algebraic treatment.• It should be close to the location were majority of the
observations are located.
87
Arithmetic Mean
• The Arithmetic Mean is usually called the Mean.• It is most familiar measure of central tendency.• It is calculated by adding all of the individual values and
dividing the sum by the number of individual values.
• In statistics, two separate letters are used for the mean.• The Greek letter (mu) is used to denote the population
mean.• The symbol (read as "x bar") is used to denote the
sample mean.
88
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 23/116
3/1/2010
23
Arithmetic Mean Cont…• When n is the total number of observations and X i is the value
of X for ith observation the formula of arithmetic mean is givenas:
• In calculating the mean from grouped data we assume allvalues falling into particular class interval are located at themid point of the interval.
• The formula is given as:
n
f m Mean
K
iii
=
n
x Mean
n
ii
== 1
89
Arithmetic Mean Cont…
Where k is the number of class intervals,mi is the mid point of the i th class interval,fi is the frequency of the i th class interval,n is total number of observations,
• The formula simply means each value within the intervalis represented by the midpoint of the true class interval.Then we can calculate the mean as usual.
90
Arithmetic Mean Cont…Example 3.1: Consider the time taken by 30 students to doa Biostatistics quiz.
Thus mean of the data is 350/30 = 11.7 minutes
Minutes spenton Quiz
Number of students (f)
True Class interval Mid point (m) m if i
1-5 2 0.5-5.5 3 66-10 12 5.5-10.5 8 96
11-20 16 10.5-20.5 15.5 248Total 30 350
91
Arithmetic Mean Cont…
• The major advantages of mean are: – It is calculated based on all observations. – Its mathematical computation is not complicated. – It accommodates further mathematical applications.
– It can only have one value.
• The major disadvantages of mean are: – It is affected by extreme values. – It shouldn’t be used when the dataset is not normally
distributed.
92
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 24/116
3/1/2010
24
Median• The Median is the value which divides the data into two equal
halves, with half of the values being lower than the Medianand half higher than the median.
• When n is the number of observation in a dataset, the medianis calculated in such a way: – Sort the values into ascending order. – If you have an odd number of observations, the median is
the middle observation, i.e. (n+1)/2 position of your data.
– If you have an even number of observations, the median isthe arithmetic mean of the two middle observations, i.e.pick the numbers at positions n/2 and (n/2) + 1 and find themean of those two observations.
93
Median Cont…
Example 3.2: Compute the median for {1, 2, 3, 4, 5}• The numbers are already sorted, so that it is easy to see
that the median is 3 (two numbers are less than 3 andtwo are bigger).
Example 3.3: Compute the median for {1, 2, 3, 4, 5, 6}• The median would be 3.5 since that is the middle
between 3 and 4, computed as (3 + 4)/ 2.• Note that three numbers are less than 3.5, and three are
bigger, as the definition of the median requires.
94
Median Cont…• When we are dealing with grouped data, the median can be
calculated as:
• Where: – Lm is the lower true class boundary of the interval
containing the interval, – F c is cumulative frequency of the interval just above the
median class interval, – F m is frequency of the interval containing the median, – W is class interval width, – n total number of observations.
wF
F n
L X m
c
m )2(~
−+=
95
Median Cont…• The major advantages of the median are:
– Not affect by extreme values, – Can be used in skewed distribution, – It is easy to calculate,
– It can only has one value, – Can be calculated when there is open end interval.
• The major limitations of the median are: – It could not be a good representative if the number of
observations is too few, – It does not accommodates further mathematical
applications (in parametric statistics), – It is calculated based on one or two observations. 96
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 25/116
25
Mode
• Mode is by far the simplest, but the least widely usedmeasure of central tendency.
• It is simply the score that occurs most frequently.• When the distribution has only one vale with highest
frequency it is called Unimodal. If it has two values withequal and highest frequency it is called Bimodal.Similarly, it is possible to have multimodal frequency.
• Example: {1, 2, 2, 3, 3, 4, 4, 4, 5}• The mode is 4.• In grouped data the mid point of the interval with highest
frequency is considered as the mode of the distribution.97
Mode Cont…
98
Salary in Br Number of Factory Workers
500-600 3
600-700 6
700-800 5
800-900 5
900-1000 0
1000-1100 1
Mode Cont…For example the following table displays the salary of 20factory workers in factory X.
mid point of this interval i.e. 650 is taken as the mode of
distribution.
99
Mode Cont…• The major advantages of the mode are:
– It can be used when the variable is ordinal or nominal, – It is very easy to compute, – It is less likely to be affected by extreme values,
– Can be calculated to distributions with open end classinterval.
• The major disadvantages of mode are: – It may not perfectly denote what central tendency imply, – It does not accommodate further mathematical application, – It is calculated based on few observations, – It may have more than a value for a dataset, – At times a mode value may not exist in a dataset.
100
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 26/116
26
Skewness and the Measures
of Central Tendency• The normal distribution is one that is bell shaped, unimodal
and symmetric.• Skewness – measures the symmetry of a distribution.• If the distribution is not symmetric, (one side does not reflect
the other), then it is skewed.• Skewness is indicated by the “tail” or trailing frequencies of
the distribution.• If the tail is to the right it is a positive skew. If the tail is to the
left then it is a negatively skewed distribution.• In normal distribution, the mean, median and mode are equal.• Skewness affect their arrangement of the three measures of
the central tendency in the following way.101
Skewness and the Measures
of Central Tendency Cont…
102
Weighted Mean• The weighted mean is similar to an arithmetic mean except it
is a mean where there is some variation in the relativecontribution of individual data values to the mean.
• Each data value (X i) has a weight assigned to it (W i).• Data values with larger weights contribute more to the
weighted mean and data values with smaller weightscontribute less to the weighted mean.
• The formula is
103
Weighted Mean Cont…
• If all the weights are equal, then the weighted mean isthe same as the arithmetic mean.
• The best example for the application of weighted meanis the calculation of GPA.
• Scoring an “A” grade has larger weight than scoring a“B” grade.
104
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 27/116
27
Geometric Mean
• The geometric mean is an average calculated bymultiplying a set of numbers and taking the n th root,where n is the number of numbers.
• Geometric mean is related to the log-normal distribution.
• The log-normal distribution is a distribution which isnormal for the logarithm transformed values.
105
Harmonic Mean• The harmonic mean (H) of n positive values is defined by the
formula;
• It is the reciprocal of the arithmetic mean of the reciprocals.• It applies more accurately to situations involving rates.• For example: A blood donor fills a 250mL blood bag at
70mL/min on the first visit, and 90mL/min the second visit.What is the average rate at which the donor fills a bag?
• Given: – 250mL at 70mL/min = 3.571 mins total – 250mL at 90mL/min = 2.778 mins total
106
Harmonic Mean Cont…• So 500mL total in (3.571+2.778) mins total = 500/6.349 =
78.753 mL/min• The harmonic mean of 2/[1/70+1/90] = 78.750 gives a more
accurate description of average rate, than the arithmetic mean(80mL/min).
• Source: http://wiki.answers.com/Q/What_is_the_application_of_harmonic_mean_in_medicine
107
Measures of Dispersion
• While measures of central tendency are used to estimate"center" value of a dataset, measures of dispersion areimportant for describing the spread of the data, or itsvariation around a central value.
• Two distinct samples may have the same mean ormedian, but completely different levels of variability, orvice versa.
– Set 1: 30, 40, 40, 50, 60, 60, 70 (Mean = 50) – Set 2: 48, 49, 49, 50, 50, 51, 53 (Mean = 50)
108
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 28/116
28
Range
• Defined as the difference between the largest andsmallest sample values ( x max -x min).
• Major advantage: It is simple to calculate.• Major disadvantages:
– It depends only on extreme values and provides noinformation about how the remaining data is distributed.
– The range value can not be used when the units of
measurements are different. – The extreme values are the most unreliable parts of the
data. – It doesn’t accommodate further mathematical
application. 109
Standard Deviation and
Variance• Standard deviation is the most common and useful
measure of dispersion.• It is the average distance of each score from the mean.• The formula for sample standard deviation is given as:
• The formula for population standard deviation is give as:
• What might be the reason for the difference?
1
)(1
2
−
−= =
n
x xS
n
ii
n
xn
ii
=
−= 1
2)( µ σ
110
Standard Deviation andVariance Cont…
• Variance is just the square of the standard deviation.• The formulas for sample and population variance are
given as follows:
• NB: Occasionally, the abbreviations SD for standarddeviation and Var for variance are used.
• Standard deviation for grouped data is calculated as:
1
)(1
2
2−
−
==
n
x x
S
n
i
i
n
xn
i
i
=
−
= 1
2
2
)( µ
σ
21
2
1 x
n
m f S
K
iii
−−
= =
111
Standard Deviation andVariance Cont…
• Advantages: – They accommodate further mathematical
applications. – They are calculated from the whole observations.
• Disadvantages: – They must always be understood in the context of the
mean of the data. – They are measured in the unit of measurement of the
observed data. Thus it is difficult to compare thestandard deviation/variance of two datasetsmeasured in two different units.
112
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 29/116
29
Coefficient of Variation (CV)
• The standard formulation of the CV is the ratio of thestandard deviation to the mean of a give data.
• The coefficient of variation is a dimensionless number.• So when comparing between data sets with different
units one should use CV instead of SD.
• The CV is useful in comparing the variability of severaldifferent samples, each with different arithmetic mean ashigher variability is expected when the mean increases.
• CV is also important to compare reproducibility ofvariables.
%100 x xS
CV =
113
Example on Grouped Data
Example 3.4:• Consider the time taken by 30 students to do a
Biostatistics quiz. Their time is summarized in thefollowing table.
Minutes spent on Quiz Number of students (f)
1-5 2
6-10 1211-20 16
Total 30
114
Example Cont…
Minutesspent on
Quiz
Number of students (f)
True Classinterval
Mid point(m)
f im i f im i2
1-5 2 0.5-5.5 3 6 186-10 12 5.5-10.5 8 96 768
11-20 16 10.5-20.5 15.5 248 3844Total 30 350 4630
minutes11.7=350/30==n
f m Mean
K
iii
min10.8=(5/16)+10.5)2(~ =
−+= w
F
F n
L X m
c
m
min6.55= 64.11629
46301
21
2
−=−−
= = xn
m f S
K
iii
115
Measures of Position (Fractiles)
• In addition to measures of central tendency anddispersion, measures of position give additionalinformation about a given data.
• Fractiles (Quantiles) are numbers that partition, or divide,an ordered dataset into equal parts.
• For instance, the median is a fractile because it dividesan ordered data set into two equal parts.
• The commonly used measure of positions are Quartiles(that divide the data into 4 parts), Deciles (that divide thedata into 10 parts), and Percentiles (that divide the datainto 100 parts).
116
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 30/116
30
Quartiles
• Quartiles divide a data set into four equal parts.• The three quartiles Q 1, Q 2, and Q 3 divide an ordered data
set into four equal parts. – About ¼ of the data falls on or below the first quartile Q 1. – About ½ of the data falls on or below the second quartile
Q2 (equivalent to median). – About ¾ of the data falls on or below the third quartile
Q3.
– About ¼ of the data falls above the third quartile Q 3.
117
Quartiles Cont…
• In order to identify the Quartiles of a given dataset• Sort the values in increasing order• Identify the Quartiles accordingly;
– Q 1 is the {0.25 (n+1)} th observation – Q 2 is the median observation or {0.5 (n+1)} th
– Q 3 is the {0.75(n+1)} th observation• NB: if the identified observation is not a whole number
then it should be determined by interpolation of theobservations on either side.
118
Quartiles Cont…
• Example: Let’s assume the following dataset presents theage of 8 factory workers. Identify the first and the thirdquartiles.{18, 21, 23, 24, 24, 32, 42, 59}
• First make sure that the data is sorted in increasing order.• Q1 is the {0.25 (n+1)} th observation
{0.25 (8+1)} th observation{0.25 (9)}th observation{2.25}th observation
119
Quartiles Cont…
• i.e. the Q 1 is a quarter distance between 21 and 23 thiscan be interpolated as:
21 + (23-21)0.25 = 21.5• The interpretation is one forth of the observations are
below or equal to the value 21.5.• Q3 is the {0.75(n+1)} th observation
{6.75}th observation32 + (42-32)0.75 = 39.5
• The interpretation is three forth of the observations arebelow or equal to the value 39.5.
120
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 31/116
31
Quartiles Cont…
Additional use of the quartiles:• The inter quartile range (Q 3- Q1) can be used as
measure of dispersion (like that of Range). Inter quartilerange can over come one of the limitations of range, (i.e.being affected by extreme values).
• Quartile deviation [(Q 3- Q1)/2] and Coefficient of quartiledeviation [(Q 3- Q1)/(Q3+ Q1)] are also rarely used asmeasures of dispersion.
• A dataset can be summarized using the so called “Thefive numbers summary” (this is sometimes representedgraphically as a box-and-whisker plot). The five numbersare: the first and third quartiles, the median, and themaximum and minimum values. 121
Deciles
• Deciles serve to partition data into10 equal parts.• Not commonly used as common as percentiles and
Quartiles.• There are 9 deciles dividing the population into 10 parts.• The deciles are termed D 1 through D 9.• The interpretation of Deciles is as follows:
– About one tenth of the data falls on or below D 1.
– About two tenth of the data falls on or below D 2. – The same meaning for other deciles.
• Note that the D 5 has similar meaning to the median orthe third quartile.
122
Deciles Cont…
A given percentile is determined in the following manner;1. Arrange the data in ascending order.2. Compute the decile using the formula:
3. NB: if the identified observation is not a whole numberthen it should be determined by interpolation of theobservations on either side.
nobservationk
decilek th
th += )1)(10
(
123
Percentiles
• Percentiles are also like quartiles, but divide the data setinto 100 equal parts.
• Each group represents 1% of the data set.• There are 99 percentiles termed P 1 through P 99.
• P 50 is yet another term for median.• Other equivalents, such as P 25=Q 1, P 75=Q 3, P 10=D 1, etc. ,should also be obvious.
• The interpretation of Percentiles is as follows: – 1% of the data falls on or below P 1. – 2% of the data falls on or below P 2. – The same for other values.
124
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 32/116
32
Percentiles Cont…
A given percentile is determined in the following manner;1. Arrange the data in ascending order.2. Compute the percentile using the formula:
3. If the identified observation is not a whole number thenit should be determined by interpolation of the
observations on either side.
nobservationk
percentilek th
th += )1)(100
(
125
Example
• The following data represents the Biostatistics result of18 students out of 100 marks. Calculate the 4 th decileand 70 th percentile.{72, 51, 59, 80, 84, 71, 82, 71, 51, 48, 66, 81, 78, 69, 75,67, 76, 75}
• Computing the 4 th decile• Before starting the computation arrange the observations
in increasing order. i.e.{48, 51, 51, 59, 66, 67, 69, 71, 71, 72, 75, 75, 76, 78, 80,81, 82, 84}
• Compute 4 th decile using the formula:126
Example Cont…
• Compute 4 th decile using the formula:
4th decile is b/n the 7 th & 8th observation (i.e. b/n 69 & 71)In order to get the exact value we have to interpolate69 + (71-69) 0.6 = 70.2About four tenth of the data falls on or below 70.2
nobservationdecileth
th += )1)(104
(4
[ ] nobservatiodecile thth )19)(4.0(4 =
[ ] nobservatiodecile thth )6.7(4 =
127
Example Cont…
• Compute the 70 th percentile• The data is already sorted• Compute the 70 th percentile using the formula
70 th percentile is b/n the 13 th & 14th observation (i.e. b/n76 & 78).In order to get the exact value we have to interpolate76 + (78-76) 0.3 = 76.6About 70% of the data falls on or below the value 76.6.
nobservation percentileth
th += )1)(10070
(70
[ ] nobservatio percentile thth )3.13(70 =
128
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 33/116
33
Rate, Ratio and Proportion
• In addition to measures of central tendency, measures ofdispersion, and measures of position, a dataset can bemathematically summarized by the use of Rate, Ratioand Proportion.
129
Rate
• In mathematics rate is a numeric presentation which isgiven in the form of fraction by which the numeratormeasures one variable and the denominator another.
• Usually the denominator of rate is a time measure.• In epidemiology we use rates to measure the occurrence
of events over time.• If time element is directly reflected into the denominator
it is called real rate. (Example: Incidence density).• If the fraction measures number of events per population
at risk in a given period of time it is called operationalrate (Example: Incidence proportion).
130
Ratio
• Mathematically a ratio is the comparison of twoquantities that have the same units (usually classes of avariable).
• A ratio can be written in three different ways: – As two numbers separated by a colon (a:b) – As a fraction (a/b) – As two numbers separated by the word to (a to b)
• In epidemiology ratio present two variables (asnumerator and denominator) where one is not includedin the other.
131
Proportion
• A proportion is usually presented in fraction, decimal orpercentage.
• Unlike ratio numerator is the subset of the denominator,hence the value indicates the overall contribution of thenumerator to the denominator.
132
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 34/116
34
Numeric Summarization
Using SPSS• In SPSS numeric summaries are available under manyalternatives. Commonly used are: – Analyze > Descriptive statistics > Frequency >
Statistics. – Analyze > Descriptive statistics > Descriptives >
Statistics. – Analyze > Descriptive statistics > Cross tabs >
Statistics. – Analyze > Descriptive statistics > Explore > Statistics. – Analyze > Reports > OLAP Cubes > Statistics.
133
Basic Probability
What is Probability
• Probability is the chance that an event will occur giventhe trial has been conducted nearly infinitely under thesame condition. OR
• The probability of an event is the relative frequency ofset of outcomes over indefinitely large (or infinite)
number of trials.• A sampling space is the set of all possible outcomes of atrial or experiment.
• Event is the subset of the sample space.• An event can be simple or composite. Composite event
contains more than one simple events.
135
Concept of Union, Intersection andComplement
136
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 35/116
35
Mutually Exclusive Events and The
Additive Law• Events are said to be mutually exclusive if they have no
outcome in common.
• Examples:
• The Additive Law when applied to two mutually exclusiveevents states that the probability of either of the twoevents occurring is obtained by adding the probability of
each event.
• p(A or B) = p(A) + p(B)
137
Mutually Exclusive Cont..
Example 4.1:• Role a six sided Die. The possible outcomes (Sampling
space) are six (1,2,3,4,5,6). Each event has equalprobability of occurrence (i.e. 1/6). Probability of rollingan even number would be:
• p(even) = p(2)+ p(4)+ p(6)
• = (1/6)+(1/6)+(1/6)=1/2
138
Mutually Exclusive Cont..
Example 4.2:
• The natural history of Tuberculosis indicates for TBpatients without any treatment, at the end of the 5 th yearof illness ½ of them would die, ¼ would developpermanent disability and ¼ would recover. What is theprobability of an untreated TB patient either to recover orto develop permanent disability (in other words to avoiddeath) after 5 years of illness?
139
Conditional Probability and theMultiplicative Law
• Conditional probability is defined as the probability that acertain event will occur given that a composite event hasalso occurred.
• p(A|B) or "probability of A given B"
• This formula is conveniently rewritten as the followingwhich is commonly referred to as the Multiplicative Rule.
p(B)B)p(A
B)|(∩
= A p
)()B|()( B p x A p B A p =∩140
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 36/116
36
Conditional Probability Cont..
Example 4.3:• What is the probability that the outcome of a roll of a die
is 2 (A2) given that the outcome is even?
Example 4.4:• A medical practitioner measured the CD4 count of AIDS
patient on ART two times with in a month. About 25% of
the patients had normal value in both tests and 42% ofthem had normal result in the first test. What percent ofthose who had normal value in the first test also have thesame in the second test?
141
Independent Events and the
Multiplicative Law• For two given events, if the occurrence or nonoccurrence
of one doesn’t affect in any way the occurrence ornonoccurrence of the other, the events are calledindependent events.
• With independent events the multiplicative law becomes:p(A and B) = p (A)p(B)
142
Independent Events Cont..
Example 4.5:• Assume we have rolled a die twice. What is the
probability to get 6 in both rolls?
Example 4.6:• The probability of getting normal birth weight baby at 33 rd
weeks gestational age is 1/5. If two pregnant women atthe aforementioned gestational age gave birth in BethelHospital yesterday, what is the probability for those twobabies to have normal birth weight?
143
Bayes' Theorem• Bayes' theorem, was published in the eighteenth century
by Thomas Bayes’.• It says that you can use conditional probability to make
predictions in reverse.• Sometimes called the inverse probability law:
• P(B|A) = P(A and B)/P(A) ………………………………1P(A|B) = P(A and B)/P(B) ………………………………2• Solving [1] for P(A and B) and substituting into [2] gives
Bayes' Theorem:
P(A|B) = [P(B|A)][P(A)]/P(B)• The general formula for Bayes' Theorem is:
144
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 37/116
37
Bayes' Theorem Cont…
Example 4.7:• Suppose there is a certain disease randomly found in
0.005% of the general population. A certain clinical bloodtest is 99% effective in detecting the presence of thedisease among persons with the disease. But it alsoyields false-positive results in 5% of individuals withoutthe disease. The following tables show the probabilitiesthat are stipulated in the example and the probabilities
that can be inferred from the stipulated information:
• (Source: http://faculty.vassar.edu/lowry/bayes.html)
145
Bayes' Theorem Cont…
P (A) = .005 The probability that the disease will be present in anyparticular person
P (~A) = 1—.005 = .995 The probability that the disease will not be present inany particular person
P (B|A) = .99 The probability that the test will yield a positive result[B] if the disease is present [A]
P (~B|A) = 1—.99 = .01 The probability that the test will yield a negative result[~B] if the disease is present [A]
P (B|~A) = .05The probability that the test will yield a positive result[B] if the disease is not present [~A]
P (~B|~A) = 1—.05 = .95 The probability that the test will yield a negative result[~B] if the disease is not present [~A]
Given:
146
Bayes' Theorem Cont…
P (B) = [P (B|A) x P (A)] + [P(B|~A) x P (~A)]= [.99 x .005]+[.05 x .995] = .0547
The probability of a positive test result[B], irrespective of whether the diseaseis present [A] or not present [~A]
P (~B) = [P (~B|A) x P (A)] + [ P(~B|~A) x P (~A)]= [.01 x .005]+[.95 x .995] = .9453
The probability of a negative test result[~B], irrespective of whether thedisease is present [A] or not present[~A]
• Given this information, the derivation of two simpleprobabilities is possible using conditional probabilityformula.
147
Bayes' Theorem Cont…
P (A|B) = [P (B|A) x P (A)] / P(B)= [.99 x .005] / .0547 = .0905
The probability that the disease is present [A] ifthe test result is positive [B]
P (~A|B) = [P (B|~A) x P (~A)] / P(B)
= [.05 x .995] / .0547 = .9095The probability that the disease is not present[~A] if the test result is positive [B]
P (~A|~B) = [P (~B|~A) x P (~A)] / P(~B)= [.95 x .995] / .9453 = .99995
The probability that the disease is absent [~A] ifthe test result is negative [~B]
P (A|~B) = [P (~B|A) x P (A)] / P (~B)= [.01 x .005] / .9453 = .00005
The probability that the disease is present [A] ifthe test result is negative [~B]
• Then it is possible to calculate the remainingprobabilities.
148
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 38/116
38
Summary of the Basic Properties of
Probability1. The value of a probability can only be 0 p 1.2. If an event is certain to occur, its probability is 1 and if an
event is certain not to occur, its probability is 0.3. If two events are mutually exclusive (disjoint), the
probability that one or the other will occur equals thesum of the probabilities: p(A or B) = p(A) + p(B)
4. If A and B are two events, not necessarily disjoint, thenp(A or B) = p(A) + p(B)-p(A and B)
5. The sum of the probabilities that an event will occur andthat it will not occur is equal to 1.6. If A and B are two independent events then p(A and B) =
p(A)p(B)7. p(A|B) = P (AnB)/P(B)
149
Random Variable and ProbabilityDistribution
Random Variable
• Any characteristic that can be measured or categorizedis called Variable.
• If a variable can assume a number of different values sothat any particular outcome is determined by chance, it iscalled a Random Variable.
• A Random Variable is a function, which assigns uniquenumerical values to all possible outcomes of a randomexperiment under fixed conditions.
151
Random Variable Cont…
Example 4.8• Three students are taken
at random from thisclassroom. Suppose ourinterest is the number offemale students that wewill get out of the threesamples. The possible listof outcomes with numberof females is:
Outcome No ofFemales
MMM 0
MMF 1
MFM 1
FMM 1
MFF 2
FMF 2FFM 2
FFF 3
152
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 39/116
39
Random Variable Cont…• There are two types of random variables.
– A Continuous Random Variable is one that takes aninfinite number of possible values; and,
– A Discrete Random Variable: is one that takes finitedistinct values.
• Example 4.9: – A coin is tossed 10 times. The random variable X is the
number of tails that are noted. X can only take the values
0, 1, ..., 10, so X is a Discrete Random Variable. – A light bulb is burned until it burns out. The random
variable Y is its lifetime in hours. Y can take any positivereal value, so Y is a Continuous Random Variable.
153
Probability Distributions
• Every Random Variable has a corresponding ProbabilityDistribution.
• A Probability Distribution applies the theory of probabilityto describe the behavior of the random variable.
• In the discrete case, it specifies all possible outcomes ofthe random variable along with the probability that each
will occur.
• In the continuous case, it allows us to determine theprobabilities associated with specified ranges of values.
154
Discrete Probability Distribution
• Usually represented bytable.
Example 4.10:• Table 4.1: Probability
Distribution of a randomvariable X representingthe birth order of childrenborn in US.
x P(X=x)1 0.4162 0.3303 0.1584 0.0585 0.0216 0.0097 0.004
8+ 0.004Total 1.000
155
Continuous ProbabilityDistributions
• Since a continuous random variable assumes infinitenumber of outcomes, it cannot be expressed in tabularform. Instead, an equation or graph describes it.
• The equation used to describe a continuous probabilitydistribution is called a Probability Density Function(PDF).
• PDF has the following properties:
156
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 40/116
40
Continuous Probability
Distributions Cont..• The area bounded by the curve of the density functionand the x-axis is equal to 1, when computed over thedomain of the variable.
• The probability that a random variable assumes a valuebetween a and b is equal to the area under the densityfunction bounded by a and b.
• The probability that a continuous random variable willequal a specific value is always zero.
157
Binomial Distribution
• A discrete probability distribution.• It handles dichotomous /binary/bernoulli random
variable.• A variable which has only two outcomes (Success and
failure).• The trial is called Bernoulli trial.
– The experiment consists of n repeated trials.
– Each trial can result in just two possible outcomes. – The probability of success (x), denoted by P, is thesame on every trial.
– The trials are independent.158
Binomial Distribution Cont..
• b(x; n, P): The probability that an n-trial binomialexperiment results in exactly x successes, when theprobability of success on an individual trial is P.
• b(x; n, P) = nCx * Px * (1 - P)n – x
159
Binomial Distribution Cont..
Example 4.11:• Suppose a die is tossed 5 times. What is the probability
of getting exactly 2 fours?• Suppose in Addis Ababa the probability of a commercial
sex worker to be HIV positive is 0.15. If we consider 5randomly selected commercial sex workers in the city,what is the probability that exactly 2 prostitutes will bepositive?
160
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 41/116
41
Binomial Distribution Cont..
Cumulative Binomial Probability:• Refers to the probability that the binomial random
variable falls within a specified range (e.g., is greaterthan or equal to a stated lower limit and less than orequal to a stated upper limit).
161
Binomial Distribution Cont…
Example 4.12:• The probability that a student is accepted to a
prestigious college is 0.3. If 5 students from the sameschool apply, what is the probability that at most 2 areaccepted?
• What is the probability of getting 4 or more HIV positivesamong 5 randomly selected sex workers given that theprobability of a commercial sex worker to be HIV positiveis 0.15?
162
Poisson Distribution
• A discrete probability distribution.• First introduced by Siméon-Denis Poisson (1781–1840)• It expresses the probability of a number of random
events occurring in a fixed period of time if these eventsoccur with a known average rate.
• A Poisson experiment is a statistical experiment that hasthe following properties:
163
Poisson Distribution Cont…
– The experiment results in outcomes that can beclassified as successes or failures.
– The average number of successes ( ) that occurs in aspecified period is known.
– The probability that a success will occur isproportional to the duration of the time.
– The probability that a success will occur in anextremely small time is virtually zero.
• Note that the distribution can also be used to quantify theprobability of occurrence of an event in a length, an area,a volume, etc.
164
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 42/116
42
Poisson Distribution Cont…
• The following notations are important, – e: A constant equal to approximately 2.71828. – : The mean number of successes (occurrence
of an event) that occur in a specified period oftime.
– x: The actual number of successes that occur ina specified period of time.
– P(x; ): The Poisson probability that exactly xsuccesses occur in a Poisson experiment,when the mean number of successes is .
165
Poisson Distribution Cont…
• Given the mean number of successes ( ) that occur in aspecified period of time, we can compute the Poissonprobability based on the following formula:
P(x ; ) = (e - ) ( x) / x!
Example 4.13:• Let’s assume the average number of breast cancer
cases death is 2 per day. What is the probability thatexactly 3 will die tomorrow?
• = 2; since 2 patients die per day, on average.• x = 3; i.e. likelihood that 3 will die tomorrow.• e = 2.71828; 166
Poisson Distribution Cont…
• We put these values into the formula as follows;P(x ; ) = (e - ) ( x) / x!
P(3; 2) = (2.71828 -2) (2 3) / 3!P(3; 2) = (0.13534) (8) / 6P(3; 2) = 0.180
• Thus, the probability of getting 3 deaths by tomorrow is0.180.
167
Poisson Distribution Cont…
Example 4.14:• In a study of suicides, a researcher found that the
monthly distribution of adolescent suicides in US followsa poisson distribution with parameter of = 2.75. Find theprobability that a randomly selected month will be one in
which three adolescent suicides occur.• P(x ; ) = (e - ) ( x) / x!• P(3 ; 2.75) = (e -2.75 ) (2.75 3) / 3!• P(3 ; 2.75) = 0.222
168
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 43/116
43
Poisson Distribution Cont…
• If the number of admissions in a hospital is 10 per houron average, determine the probability that, in any hourthere will be:
0 admissions;6 admissions;Less than 2 admissions.
169
Normal Distribution
• Is the most important probability distribution function.• It is also known as the Gaussian Distribution.• Named after Carl Friedrich Gauss (1777–1855).• Given by the formula:
• The formula is affected by two main factors: mean andSD
2
2
2
)(
*]2*)1
[( σ µ
π σ
−−
= x
eY
170
Normal Distribution Cont…
Normal distribution has the following chx:1. Bell shaped2. Symmetrical at the mean3. Unimodal
4. Mean median and mode are equal5. Area under the curve is 16. Extends from negative infinity to positive infinity
• The normal distribution can be used to describe, atleast approximately, any variable that tends to clusteraround the mean. (Mainly as result the central limittheorem) 171
Skewness, Kurtosis, andNormal Curve
• Skewness and kurtosis are used to measure normality.• Significant skewness and kurtosis indicate that data are
not normal.• Skewness is a measure of asymmetry.• For univariate data Y 1, Y 2, ..., Y N , the formula for
skewness is:
• Where Y bar is the mean, S is the standard deviation,and N is the number of data points.
• The skewness for a normal distribution is zero, and anysymmetric data should have a skewness near zero. 172
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 44/116
44
Skewness, Kurtosis Cont…
• Kurtosis is a measure of whether the data are peaked orflat relative to a normal distribution.
• For univariate data Y 1, Y 2, ..., Y N , the formula for kurtosisis:
• The kurtosis for a normal distribution is three.• For this reason, some use the following definition of
kurtosis (often referred to as "excess kurtosis"):
• Positive kurtosis indicates a "peaked" distribution andnegative kurtosis indicates a "flat" distribution. 173
Normality Test
• Normality tests assess the likelihood that the given dataset comes from a normal distribution.
• It is important aspect statistics as many proceduresassume normality.
• Typically the null hypothesis H 0 is that the observationsare distributed normally with unspecified mean andvariance 2.
• The alternative H a that the distribution is arbitrary.• A great number of tests (over 40) have been devised for
this problem, the more prominent of them are outlinedbelow:
174
Normality Test Cont…
• The simplest method of assessing normality is to look atthe frequency distribution histogram. (symmetry,peakiness of the curve, modality of the distribution).
• The other option is the use of probability plots.• Probability Plot Is a graphical technique for comparing
two datasets, either two sets of empirical observations,one empirical set against a theoretical set, or twotheoretical sets against each other. – It is a common way of assessing normality, i.e. by
comparing a given data against normal distribution. – Has two variants; Q-Q plot and P-P plot.
175
Normality Test Cont…
• Quantile-Quantile Plot (Q-Q plot): – Compares two probability distributions by plotting
their quantiles against each other. – If the two distributions being compared are similar,
the points in the Q-Q plot will approximately lie on theline y = x.
• Probability-Probability plot (P-P plot): – Compares two probability distributions by plotting
their cumulative distribution functions against eachother.
176
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 45/116
45
Normality Test Cont…
• It is possible to assess normality of a data objectivelyusing statistical techniques. (Example: Kolmogorov-Smirnov test, Shapiro-Wilk test).
• In SPSS:• Analysis > descriptive statistic > explore > enter the
variable under dependent list > open plot and “check“normality plots with test” > continue > ok.
• But such tests have serous limitation as: – Small samples almost always pass a normality test, – With large samples minor deviations from normality
may be flagged as statistically significant.177
Normal Distribution Cont…Application of Normal distribution to calculate probability:1. Area under the curve is 1,2. Probability of x > a is the area between a and positive
infinity,3. Probability of x < a is the area between a and negative
infinity,4. Probability of b<x<a is the area between a and b,5. Probability of x = a is zero,
6. The empiric rule of 68%, 95% and 99.7% rule.
But how can we compute the area???
178
Standard Normal Distribution
• Is a normal distribution with a mean of 0 and a standarddeviation of 1.
• Any point (x) from a normal distribution can be convertedto the standard normal distribution (Z) with the formula:
Z = (x-mean)/standard deviation.
• Corresponding area can be calculated from a standardtable.
179
Standard Normal DistributionCont..
Example 4.15:• if 1.4m is the height of a student where the mean for
students of his age and sex is 1.2m with a standarddeviation of 0.4. – What is the corresponding Z value for the student? – What is the probability to have a student more than
height of 1.4?
180
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 46/116
46
Standard Normal Distribution
Cont..Example 4.16:• Assume a distribution of blood glucose level among
medical students is normally distributed with mean of90mg/dl and SD of 6mg/dl. Student X has mean glucoselevel of 100mg/dl. Another student Y has mean glucoselevel of 80mg/dl. – What is the Z score for student X? – What is the Z score for student Y? – What is the probability of getting mean glucose level
less than 100mg/dl ? – What is the probability of getting mean glucose level
less than 80mg/dl ?
181
Standard Normal Distribution
Cont.. – What range around the mean which encompasses68% of the observation?
– What is the probability for a student to have bloodglucose level between 100 and 105 mg/dl?
182
Standard Normal DistributionCont..
Example 4.17:• Among pregnant women having ANC follow-up in a
hospital, WBC count follows normal distribution withmean of 8,000 and standard deviation of 800. – What is the probability to get WBC more than 10,000
in those pregnant women? – What is the probability to get WBC count between
7,500 and 10,000?
183
Standard Normal DistributionCont..
1. Suppose in BL Hospital the probability of a donated blood to bepositive to Hepatitis B is 0.2. If we consider 4 randomly selecteddonated bloods, what is the probability that exactly 2 of thesamples will be positive for Hepatitis B?
2. Suppose that systolic blood pressures follow a normal distributionwith a mean of 108 and a SD of 14. According to this informationattempt the following questions. – About 95% of the blood pressures are between ____ & ____. – About ______% of the blood pressures are between 66 & 150 – What is the probability that a patient’s BP is > 120? – What is the probability that the patient’s BP is b/n 110 & 130? – What is the probability that a patient’s BP is < 108.
184
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 47/116
47
Introduction to DemographicMethods and Health Service
Statistics
What is Demography?
• “Demos” + “graphy”• Is a discipline that studies human population with respect to
size, composition, distribution, mobility and its variation withrespect to all the above features and the causes of suchvariations and the effect of all these on health,environmental, social, ethical and economic conditions.
• Demography as a “method” and “data”.• Demography studies a population in “static” and “dynamic”
aspects.• Static aspects include characteristics at a point in time such
as composition by Age, Sex, Race, Marital status etc.• Dynamic aspects are Fertility, Mortality, Nuptiality, Migration
and Growth.186
Source of Demographic Data
• Demographic data can be acquired through threemethods:
– Census – Survey – Vital Registration
187
Census
• Refers to the total process of collecting, compiling,analyzing, and publishing or otherwise disseminatingdemographic, economic, and social data pertaining to allpersons in a country or in a well-delineated part of acountry at a specified time.
• Census has the following characters: – Universality – Simultaneity – Individual enumeration – Regular interval
188
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 48/116
48
Census Cont..
• The first real census was conducted in UK in 1841.• However there are evidences of large scale counting of
population starting from the prehistoric period.
• Content of Census – Demographic data – Economic data
– Social data – Mortality and Birth
189
Approaches to Census
De jure:• The enumeration is according to the legal or customary
place of residence.• i.e. people are registered where they usually reside.• Such type of counting gives information relatively
unaffected by seasonal and temporary movements.• However, this might not be accurate when a person’s
legal or customary residence is not known.• It also creates risk of omission and double counting.• Information collected from a person away from his/her
usual residence can also be incomplete.190
Approaches Cont…
De facto:• The enumeration is according to physical residence at
the time of the census.• i.e. people are registered where they are currently
staying/residing at the time of the census.• This method is advantageous in a sense that it has got
less chance of double counting or omission.• However, if it is applied in areas where there is high level
of migration and mobility, the result can be distorted.
191
Advantage and Disadvantage ofCensus
• Advantage – It represents the whole population, – Serves as sampling frame for further studies, – Provides population denominators, – Provides small area data.
• Disadvantage – Size limits content and quality control efforts, – Cost limits frequency, – Delay between field work and results, – Sometimes politicized.
192
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 49/116
49
Vital Registration (Civil Registration)
• Vital Registration is continuous registration of vitalevents as they happen.
• What are the vital events?• Vital Registration is relatively modern concept in its
present format.• The major purpose of vital registration is primarily
administrative.• Vital Registration has got the following features:
– Continuity – Universality
193
Advantages of Vital Registration
• Continuously monitors vital rates,• May provide both numerator and denominator for
some rates,• Small area data available,• Can be used as base for testing the accuracy of
censuses and surveys,• Once a system is established, it would be cost
effective.
194
Disadvantages of Vital Registration
• Uncertain coverage,• It is difficult to establish the system,• Information may come from third party,• It can easily be disrupted by political/economic events.
195
Survey
• Refers to the process of obtain information from asample representative of some population at a givenpoint in time.
• How can we make it representative?• Survey can be of two types:
– Single rounded retrospective survey – Multi-round follow up survey
• The content of survey widely varies.• Features of Survey:
– Representativeness, – Smaller size – More in-depth information.
196
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 50/116
50
Advantage and Disadvantage ofSurvey
• Advantages: – Quick and inexpensive, – Gives detailed data, – Follow up can be achieved
• Limitations: – Small area data might not be available,
– Perfect representativeness is difficult to achieve, – A survey can only be focused on few thematic areas.
197
Demographic Transition
• Conceptual framework to explain population change overtime.
• Developed by American demographer WarrenThompson, 1929.
• Observed changes in birth and death rates inindustrialized societies over the past two hundred years.
• Demographic change has got three stages.• Developed countries started the second stage in the
beginning of eighteenth century. Less developedcountries began the transition later.
198
Demographic Transition Cont…
199
Demographic TransitionCont…
• Stage I: Characterized by high and fluctuating mortality,high fertility and low population growth.
• Stage II: Characterized by beginning of mortality declinefollowed by fertility decline. This is the period of rapidpopulation growth.
• Stage III: Characterized by low mortality, low andfluctuating fertility, growth slows down and eventuallyreaches a no-growth stage.
200
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 51/116
51
Important Indicators of Compositionof a Population
1. Sex Ratio : Is the total number of male population per1000 female population. This can be explained as Y to1000, Y:1 or Y/X when Y is number male and X isnumber of female.
2. Child to Women Ratio : This is the ratio of number ofchildren under five to number of women of reproductiveage in given place and time. It can also be used asmeasure of fertility.
3. Dependency Ratio : Describe the ratio between nonproductive (age 0-14 and 65+) and productive (15-64)age groups in given place and time.
4. Population Pyramid:201
Population Pyramid
• A graphical illustration that shows the distribution ofvarious age groups in a population.
• Normally forms the shape of a pyramid.• Consists of two back-to-back bar graphs, with the
population plotted on the X-axis and age on the Y-axis,• One showing the number of males and one showing
females in a particular population in five-year agegroups.
• Males are shown on the left and females on the right.
202
Population Pyramid
203
Population Pyramid
204
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 52/116
52
Vital Statistics
• Among the focus of demography, some of the issues aremore important and applicable in public health.
• Especially the measures of mortality and fertility are vitalinputs to the health system so they are called VitalStatistics.
205
Measures of Fertility
• Crude Birth Rate (CBR) : The number of live births in ayear per 1000 mid year population in the same year.
1000 x year sameain population year Mid year ainbirthsliveof number Total
CBR =
206
Measures of Fertility Cont..
• General Fertility Rate (GFR): The number of live birthsin a year per 1000 mid year women of reproductive age.
10004915
x year sameain yrsaged population female year Mid
year ainbirthsliveof number TotalGFR
−=
207
Measures of Fertility Cont..
• Age Specific Fertility Rate (ASFR ): Refers to thenumber of live births in a year per 1000 women ofreproductive age in a give age or age group.
• Usually ASFR is calculated for the following 7 age
groups of 5 years age category: 15-19 yr, 20-24 yr, 25-29 yr, 30-34 yr, 35-39 yr, 40-44 yr, 45-49 yrs.
1000 x year sametheingroupagesamethe for population female year Mid
year aduringgroupagegivenaof womentobirthsliveof noTotal ASFR =
208
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 53/116
53
Measures of Fertility Cont..
Age category ASFR
15-19 104
20-24 228
25-29 241
30-34 231
35-39 160
40-44 84
45-49 34209
Measures of Fertility Cont..
• Total Fertility Rate (TFR): The number of children awoman expected to have at the end of her reproductiveage given the current ASFRs are maintained.
• Mathematically, it is the sum of all ASFRs from 15-49yrs.
• TFR for data given in the usual 5 years age category isprovided as:
=
=7
1
5i
i ASFR xTFR
210
Measures of Fertility Cont..
• Gross Reproduction Rate (GRR) : Is the total fertilityrate restricted to female births only.
1000Pr xbirths femaleof oportion xTFRGRR =
211
Measures of Fertility Cont..
• Child Ever Born (CEB) :• Total number of children a woman has ever given birth
to.• It is the average number of children a woman has in a
given study area.
212
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 54/116
54
Measures of Fertility Cont..
Example 5.1:• Calculate ASFR, TFR, GFR, CBR from the following
data.
213
Measures of Fertility Cont..Age category Women of
reproductive ageLive
birthsASFR
15-19 15,600 159620-24 14,400 330025-29 13,300 321030-34 12,200 283035-39 11,600 1860
40-44 10,100 85045-49 9,200 320Total 86,400 13,966
214
Measures of Mortality
• Crude Death Rate (CDR): Refers to total number ofdeaths in a given area usually in a year per 1000 midyear population.
1000 x population year Mid
year per deathof number TotalCDR =
215
Measures of Mortality
• Age Specific Death Rate (ASDR): Quantifies deathoccurring in defined age category in a given area per1000 mid year population of same age category.
1000 x
year sametheincategoryagethat of population year Mid
year aincategoryagegivenaindeathof No ASFR =
216
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 55/116
55
Measures of Mortality• Neonatal Mortality Rate (NMR): It refers to number of
death before the age of 28 days (neonatal period) in ayear out of 1000 live births in the same year.
• Infant Mortality Rate (IMR): It refers to number of deathbefore the age of 1 year (Infancy period) in a year out of1000 live births in the same year.
• Under Five Mortality Rate (U5MR): Quantifies theprobability of dying between birth and age five per 1000live births in a given year.
• Child Death Rate (ChDR): Quantifies the probability ofdying between age of one and five years per 1000 livebirths in a given year. 217
Measures of Mortality
• Cause Specific Mortality Rate (CSMR):
• Cause Specific Death Ratio (ProportionateMortality Ratio):
1000sec
xrisk at Population
year aincausegivenatoondarydeathof NoCSMR =
1000sec
Pr x year sametheindeathof noTotal
year aincauseatoondarydeathof No Ratio Mortalityeoportionat =
218
Measures of Mortality
• Maternal Mortality Ratio:
• Maternal Mortality Rate:
100000 x year sametheinbirthsliveof number Total
year givenaindeathmaternalof Number MMR o =
100000 x year sametheinagevereproductiof womenof number Total
year givenaindeathmaternalof Number MMRa =
219
Measures of Migration
• Crude In-Migration Rate : Number of in-migrants (I)per 1,000 population in a given year.
• Crude Out-Migration Rate : Number of out-migrants(O) per 1,000 population in a given year.
• Crude Net Migration Rate : Difference between thenumber of in-migrants (I) and number of out-migrants(O) per 1000 population in a given year.
220
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 56/116
56
Measures of Marriage
• Crude Marriage Rate: Number of marriage (M) per1000 population in a given year.
• General Marriage Rate: Number of marriage (M) per1000 population age 15 and older in a given year.
221
Measure of Population Growthand Projection
• Crude Rate of Natural Increase (r):
• Population Projection:
• Population Doubling Time:
CDRCBRr −=
t ot r PP )1( +=
)1(log2log
r t
+=
222
Health Service Statistics
• Data generated from the health system itself.• Advantages:
– Gives morbidity information – Identify priority health problem in the area.
– Determine met and unmet health need. – Determine success or failure of specific
health care program. – Assess utilization of health service.
223
Health Service Statistics Cont..
• Limitations – Lack of completeness – Lack of representativeness to the general
community
– Lack of denominators – Lack of uniformity – Lack of quality – Lack of compliance with reporting
224
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 57/116
57
Health Service Statistics Cont..
1. Relative Frequency of a Disease:
2. Cure Rate:• Quantifies proportion of patients who have been cured
for a disease condition using a treatment modality out of100 patients who received similar type of treatment.
• The term “Success Rate” can be used if the measuredparameter is a procedure.
%100diseasegivenaof FrequencyRelative xvisitsninstitutiohealthof number Total
diseasespecificawithdiagnosed patientsof No=
%100modsin
xtreatment therecieved who patientsof Number
alitytreatment agudiseasegivenaof patientscured of No RateCure =
225
Health Service StatisticsCont..
3. Admission Rate:• Quantifies proportion of admissions of patients among
patients who visited the health institution in a givenperiod of time.
4. Hospital Death Rate:• Quantifies proportion of deaths among hospitalized
patients in a given period of time.
226
%100 xninstitutiothevisited patientsof number Totalninstitutiohealthatoadmitted patientsof No
Rate Admission =
%100 xadmissionof noTotal
patientsed hospitalizamongdeathof No Rate Dealth Hospital =
Health Service StatisticsCont..
5. Bed Occupancy Rate:• Quantifies percentage occupancy of hospital beds in a
year.
6. Average Length of Stay:• Quantifies the average duration (in days) of hospitalized
patients.
227
deathsor esdiscof Number days patient ed hospitalizof number Annual
ALSarg
=
%100365
xbedsof number total x
days patient ed hospitalizof number Annual BOR =
Sampling Method
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 58/116
58
Why Sampling?
• Sampling is that part of statistical practice concerned withthe selection of individual observations intended to yieldreasonable knowledge about a population of concern,especially for the purposes of statistical inference.
• Study population Vs Target (Source) (Reference)Population.
• Parameter: A descriptive measure computed from the dataof the source population,
• Statistic: A descriptive measure computed from the data ofa sample.• The issues of adequate sample size and representative
sampling technique are important for correct estimation ofthe parameter using a statistic. 229
Why Sampling?
230
Why Sampling?
• Researchers rarely survey the entire population for tworeasons
(1) The cost is too high and(2) The population is dynamic.
• Main advantages of sampling:(1) The cost is lower,(2) Data collection is faster, and(3) It is possible to ensure accuracy and quality ofthe data because the dataset is smaller.
• Main disadvantage of sampling – Non representativeness (sampling error)
231
SamplingImportant terms:• Sampling Unit: Is the unit of selection in the sampling
process.• Study Unit: The unit on which information is collected.• Sampling Frame: The list of all the units in the source
population from which a sample is to be taken.• Sampling Fraction (Sampling Interval): The ratio
between the number of units in the sample to thenumber of units in the source population.
232
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 59/116
59
Types of Sampling
• Probability Sampling : Every unit in the population hasa known, non-zero probability, of being sampled and theprocess involves random selection.
• Nonprobablity Sampling : Nonprobability sampling isany sampling method where some elements of thepopulation have no chance of selection or where theprobability of selection can't be accurately determined.
233
Probability Sampling
– Simple Random Sampling (SRS) – Systematic Random Sampling – Stratified Sampling – Cluster Sampling – Multistage Sampling
234
A. Simple Random Sampling (SRS)• Is the purest (the most representative) form.
• Each member of the population has an equal, nonzeroand known chance of being selected.
• This could be accomplished by writing each study units
name on a slip of paper and selecting adequatenumber of them using Lottery Method.
• It can also be done by assigning a number to eachsampling unit then samples are selected using Tableof Random Numbers or Computer packages.
235
How to use table of randomnumbers
1. Number each member of the population.2. Determine population size (N).3. Determine sample size (n).4. Determine starting point in table by randomly picking a
page and dropping your finger on the page with youreyes closed.
5. Choose a direction to read. (to the left, right, down or up)6. Select the first n numbers read from the table whose lastdigits are between 0 and N.
7. Once a number is chosen, do not use it again.8. If you reach the end of the table before obtaining your n
numbers, pick another starting point, read in a differentdirection, and continue until done.
236
3/1/2010
Si l R d S li
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 60/116
60
Simple Random SamplingCont…
• When large dataset is available in databases, statisticalpackages can select a given size randomly.
• In SPSS: – Data > Select Cases > Random > complete the
dialogue box accordingly.
• In Excel: – Tools > Data Analysis > Sampling > Complete the
dialogue box accordingly.
237
Simple Random Sampling Cont…
Limitation of SRS• Requires sampling frame,• Takes longer time.
238
B. Systematic Random Sampling
• Selects units at a fixed interval throughout the samplingframe after a random start.
• The steps are: – Number the units in the population from 1 to N, – Decide on the n (sample size) that you need,
– Calculate the Sampling Fraction k (K = N/n), – Randomly select an integer between 1 to k, – Then take every k th unit.
239
Systematic Random SamplingCont...
• Advantage: – It is easier and less time consuming to perform. – Rarely it can be conducted without sampling frame.
• Disadvantage:
– Can be biased when there is cyclic patter in the orderof the subjects.
240
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 61/116
61
C. Stratified Sampling
• Applied when the source population is heterogeneouson a variable of interest.
• The population is first divided into classes (strata).
• Then a separate sample is taken from each stratumusing Simple or Systematic Random Sampling tech.
• The number taken from each stratum might be equal(Non Proportional Stratified Sampling) or the number isdetermined based on the proportion of each class inthe source population (Proportional StratifiedSampling).
241
Stratified Sampling Cont…
• Advantage: improves representativeness of the sample(Proportional Stratified Sampling) or it createsreasonable comparison among strata (Non ProportionalStratified Sampling).
• Limitation: Requires separate sampling frame for eachstratum.
242
D. Cluster Sampling
• Is a sampling method applied when the sourcepopulation is composed of “natural” groups.
• Assuming the groups are homogenous among eachother, Cluster sampling selects few groups (clusters)
from the population as Primary Sampling Unit (PSU).
• Then the required information is collected from allelements, Secondary Sampling Units (SSU), withineach selected group.
243
Cluster Sampling Cont..
• Advantage: – It doesn’t require the sampling frame of the SSU. – Requires less time and resource.
• Disadvantage: – Relies on the assumption of homogeneity among
clusters. – Less control on sample size.
244
3/1/2010
Probability Proportional to Size
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 62/116
62
E. Multistage Sampling
• Is like cluster sampling, but involves selecting a samplewithin each chosen cluster, rather than including all unitsin the cluster.
• Thus, multi-stage sampling involves selecting a samplein at least two stages.
• The advantage is it is simpler than SRS.
• But the disadvantage is as the “number of stages”increased, sampling error inflates.
245
Probability Proportional to SizeSampling Technique
• PPS is a variant of cluster sampling technique.• Useful when the sampling units vary considerably in
size.• Probability of selecting a sampling unit (e.g., village,
zone, district, health center) is proportional to the size ofits population.
Involves the following procedures• List all clusters with their respective source population
size and cumulative frequency.• Decide the number of clusters (a) which will be included
in the study.246
PPS Cont…
• Decide the number of individuals which will be studiedper one selection of a cluster (b).
• Divide the total population by number of clusters to bestudies. This will give you the sampling interval (SI)
• Choose a number between 1 and the SI at random. This
is the Random Start (RS) point.• Calculate the following series: RS; RS + SI; RS + 2SI;
.....RS + (a-1)SI.• Based on the cumulative frequency identify at which
clusters the selected numbers fall.• For every selection of a cluster select b individuals at
random from it. Note that if a cluster is selected twice 2bindividuals should be selected at random.
247
2. Nonprobablity Sampling
• Here, the sample is less likely to be representative ofthe population, thus it is difficult to extrapolate from thesample to the population.
• Is used when there is no sampling frame or when it isimpossible to conduct probability sampling due toeconomical and feasibility factors.
248
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 63/116
63
Nonprobablity Sampling Cont..
• Judgmental or Purposive Sampling: The researcherchooses the sample based on who he/she think would beappropriate for the study.
• Convenience Sampling: The selection of units from thepopulation is based on availability and/or accessibility.
• Quota Sampling: It starts with systematically setting“Quota” to represent subgroups of a population. Thendata is collected to meet the predefined Quota.
• Snowball Sampling: The researcher begins by identifyingsomeone who meets the inclusion criteria of the study.Then the study subject would be asked to recommendothers who s/he may know who also meet the criteria. 249
Sampling Error
• Sampling error or estimation error is part of the totalerror or uncertainty caused by observing a sampleinstead of the whole population.
• Non-sampling errors such as non-response andreporting errors may also affect the outcome of a samplebased study.
• Theoretically estimated from a sample minus thepopulation value.
• Unlike bias, sampling error can be predicted, calculated,and accounted for.• There are several measures of sampling error.
250
Sampling Error Cont…
1. Standard error• Is a measure of the variability of an estimate due to
sampling.• It indicates the extent to which an estimate derived from
a sample survey can be expected to deviate from the
population value.• Depends upon the underlying variability in the population
for the characteristic as well as the sample size used forthe survey.
• The standard error is a foundational measure from whichother sampling error measures are derived.
251
Sampling Error Cont…2. Confidence intervals:• A range that is expected to contain the population value
of the characteristic with a known probability.3. Margin of error:• Is a measure of the precision of an estimate at a given
level of confidence.4. Coefficient of variance:• The relative amount of sampling error in comparison
with a sample estimate.• CV = SE / Estimate * 100%• No hard and fast rules to define acceptable level.• The smaller the CV, the more reliable the estimate.
252
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 64/116
64
Sampling Error Cont…
5. P values:• is the probability of obtaining a test statistic at least as
extreme as the one that was actually observed,assuming that the null hypothesis is true.
Importance of such measures:• To indicate the statistical reliability and usability of
estimates.
• To make comparisons between estimates.• To conduct tests of statistical significance.• To help users draw appropriate conclusions about data.
253
Exercise 1
• A medical practitioner wanted to assess the quality offamily planning service offered in a hospital. Accordinglyhe made an exit interview to those women who have IDnumber of multiple of five. What sampling method isemployed?
254
Exercise 2
• A medical practitioner wanted to assess the prevalenceof malnutrition among under five children in a woreda.Assuming all kebeles in the woreda are similar, heincluded all under five children in two randomly selectedkebeles. – What sampling method is employed?
– What possible limitation do you expect?
255
Exercise 3
• A medical practitioner wanted to assess the prevalenceof malnutrition among under five children in a woreda.Assuming the problem is different across the three agro-ecological zones in the woreda he included children from2 kebeles each from Kolla, Dega and Woynadega. – What sampling method is employed? – What possible limitation do you expect?
256
3/1/2010
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 65/116
65
Exercise 4
• A researcher wanted to study the prevalence of drugaddiction among adolescents in Addis Ababa. First herandomly select Bole sub city. Then he selected woreda17 at random from all woredas in Bole sub city. Finallyhe conducted his study in Kebele 19 (after randomselection). – What sampling method is employed?
– What possible limitation do you expect?
– If woreda 17 was selected because of its proximity tothe organization of the researcher what would havebeen the sampling method?
257
Sampling Distribution andEstimation
Estimation
• Estimation refers to the process by which one makesinferences about a population, based on informationobtained from a sample.
• Can be of two types:
– Point Estimation – Interval Estimation
259
Point Estimate
• Point Estimate: A point estimate of a populationparameter is a single value of a statistic.
• The following table gives commonly used pointestimators.
260
3/1/2010
I l E i I l E i C
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 66/116
66
Interval Estimate
• An interval estimate is defined by two numbers, betweenwhich a population parameter is said to lie.
• For example, is an interval estimate of thepopulation mean .
• i.e. the population mean is greater than a but less than b.
• An interval estimate has got three components(concepts).
b X a <<
261
Interval Estimate Cont….
• An interval estimate has got three components (concepts) – A statistic: (the point estimator) – A margin of error: (the measure of precision) – A confidence level: (the measure of uncertainty)
• The interval estimate of a given confidence level isdefined by the sample statistic + margin of error.
• Interval Estimate is preferred than point estimate as itconsiders the precision and uncertainty of estimation.
262
Interval Estimate Cont….
Margin of Error• In a confidence interval, the range of values above and
below the sample statistic is called the margin of error.
• It measures the precision of a sampling method.
• It is the function of the confidence level and anotherparameter called the standard error.
263
Interval Estimate Cont….
• Confidence Level – The probability part of the interval. – It describes how strongly we believe that a particular
sampling method will produce an interval thatincludes the true population parameter.
– 90, 95, and 99% Confidence interval – For example, 95% CI means: If we used the same
sampling method to select different samples andcompute different interval estimates, the truepopulation mean would fall within a range defined bythe sample statistic + margin of error in 95% of thetime.
264
3/1/2010
I t l E ti t C t CI f i gl
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 67/116
67
Interval Estimate Cont….
• Example 6.1: – A local newspaper conducts an election survey and
reports that the independent candidate will receive30% of the vote. The newspaper states that thesurvey had a 5% margin of error and a confidencelevel of 95%.
– Meaning: We are 95% confident that the independent
candidate will receive between 25% and 35% of thevote.
265
CI for a single mean• Background Concept : Sampling Distribution of Means.
– One can generate sampling distribution of means in thefollowing manner:
– Obtain a sample of n observations selected completelyat random from a large population. Determine theirmean and then replace the observations in thepopulation.
– Repeat the sampling procedure indefinitely. – The result is a series of means of sample size n. – If each mean in the series is now treated as individual
observation and arranged in a frequency distribution,one comes up with the sampling distribution of means ofsamples of size n. 266
CI for a single mean cont..
• The sampling distribution of means has the followingproperties:
1. The mean of the sampling distribution of means is thesame as the population mean.
2. The SD of the sampling distribution of means (which iscalled the standard error of the mean) is:
3. Sampling distribution of means is approximately anormal distribution, regardless of the original distributionprovided n is large. ( Central Limit Theorem )
n x / σ σ =
267
CI for a single mean cont..
• The general formula is
• CI=Sample statistic + Z value x SE
95.0)96.1 /
(-1.96Pr =≤
−
≤n
x
σ µ
[ ] 95.0) / (96.1) / (96.1Pr =+≤≤− n X n X σ µ σ
) / (96.1%95 n X for CI σ µ ±=
) / (2
n Z X for CI σ µ α ±=
268
3/1/2010
CI for a single mean cont CI for a single mean cont
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 68/116
68
CI for a single mean cont..
• However when the population variance is unknown andthe sample size is less than 30: – Sample variance should replace population variance – Student t distribution should be used in the place of
standard normal distribution. – Hence the formula would be:
) / (, )1(2
nt X n σ µ α −±=
269
CI for a single mean cont..
Example 6.2:• The mean blood glucose level of 100 randomly selected
healthy adults is 85mg/dl. Find 95% CI for the meanblood glucose level for all health adults (µ) given thestandard deviation for the population is 15mg/dl.
270
CI for difference between twomeans
• Background Concept : The Sampling distribution ofDifference of Means. – Consider two different populations X and Y. – The first population has mean of µ x and standard
deviation of x.
– The second population has mean of µ y and standarddeviation of y. – From the first population take a sample of size n x and
compute its mean . – From the second population take a sample size of n y
and compute its mean . – Then determine .
X
Y Y X − 271
CI for difference between twomeans cont…
• Do the same for all pairs of samples that can be chosenindependently from the two populations.
• The Differences are new set of scores which formthe sampling distribution of differences of means.
Y X −
272
3/1/2010
CI for difference between two CI for difference between two
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 69/116
69
means cont…• Properties of the sampling distribution of differences of
means.1. The mean of the sampling distribution of differences of
means equals to the difference of the population means( ).
2. The SD of the sampling distribution of differences ofmeans (SE) is equal to:
3. The distribution is approximately normally distributed.
21 µ µ −
2
22
1
21
)( nnY X
σ σ σ +=−
273
means cont…
95.0)96.1()()()96.1()(Pr2
22
1
21
212
22
1
21 =++−≤−≤+−−
nnY X
nnY X
σ σ µ µ
σ σ
95.0)96.1)()(
96.1(Pr
2
22
1
21
21 =<
+
−−−<−
nn
Y X
σ σ
µ µ
)()(2
22
1
21
221 nn
Z Y X σ σ
µ µ α +±−=−
)96.1()(%952
22
1
21
21
nn
Y X of CI σ σ
µ µ +±−=−
274
CI for difference between twomeans cont….
Example 6.3:• A randomly selected 120 HIV patients who were on ART
had averagely lived for 25 years with SD of 5 years sincetheir diagnosis for the virus was made. Similarly arandomly selected 140 HIV patients who were not onART had averagely lived for 14 year with SD of 4 years.
• Calculate the point estimate for the difference betweenthe population means.
• Find the 95% CI for the difference between the means.
275
CI for single proportion
• Background Concept : The Sampling distribution ofProportions
• Here we are interested in the proportion of thepopulation that has a certain characteristic representedby P or .
• If we take indefinite random sample of n observation andif we calculate p for all samples then we will havesampling distribution of proportions.
• The sampling distribution of proportion has the followingcharacteristics:
276
3/1/2010
CI for single proportion cont CI for single proportion cont
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 70/116
70
CI for single proportion cont…
• The sampling distribution of proportions has thefollowing properties:
1. The mean of sampling distribution of proportions = ,2. The SD (SE) of the sampling distribution of proportions:
3. The distribution is approximately normally distributed.
nPP
P
)1( −=σ
277
CI for single proportion cont..
95.0)96.1)1(
96.1(Pr =<−
−<−
nPP
p π
nPP
p for CI )1(
(96.1%95−
±=π
))1(
(2 n
PP Z p
−±= α π
95.0))1(
(96.1))1(
(96.1Pr =−
+≤≤−
−n
PP p
nPP
p π
278
CI for single proportion cont..
Example 6.4:• In Addis Ababa blood test of randomly selected 120
commercial sex workers revealed that 30 of them areHIV positive. What will be the 99% confidence interval ofHIV/AIDS prevalence for whole commercial sex workersin the city?
279
CI for difference between twoproportions
• Consider two different populations X and Y.• The first population has proportion of and the second
population has proportion of .• From the first population take a sample of size n x and
compute its sample proportion p x. From the second
population take a sample size of n y and compute itssample proportion p y.
• Then determine p x-py.• Do for all pairs of samples that can be chosen
independently from the two populations.• The Differences p x-py are new set of scores which form
the sampling distribution of differences of proportions.280
3/1/2010
CI for difference between two CI for difference between two
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 71/116
71
proportions cont…• The sampling distribution of differences of proportions
has the following properties:1. The mean of the sampling distribution of differences of
proportions equals the difference of the populationproportion ( - )
2. The SD (SE) given as:
3. The distribution is approximately normally distributed.
2
22
1
11)(
)1()1(21 n
p pn
p p p p
−+
−=−σ
281
proportions cont…
95.0)96.1)1()1(
)()(96.1(Pr
2
22
1
11
2121 =<−
+−
−−−<−
n
p p
n
p p
p p π π
95.0))1()1(
96.1()()()1()1(
96.1()(Pr2
22
1
111121
2
22
1
1111 =
−+
−+−≤−≤
−+
−−−
n
p p
n
p p p p
n
p p
n
p p p p π π
2
22
1
112121
)1()1((96.1)(%95
n
p p
n
p p p p for CI
−+
−±−=− π π
2
22
1
11
2
2121)1()1(
()(n
p pn
p p Z p p
−+
−±−=− α π π
282
CI for difference between twoproportions cont…
• Example 6.5:• Among randomly selected 200 illiterate married women,
50 of them use contraceptive. Similarly, among randomlyselected 300 married women who can read and write,150 of them use contraceptive.
• Calculate the point estimate for the difference betweenthe population proportions.
• Find the 95% CI for the difference between theproportions.
283
CI for OR and RR
• When the intention of measurement of association is tohave inference about a population parameter, CI for ORor RR can be calculated using the following formula.
• Why do we need natural logarithm here?
]1111
[ln(OR)expORforCI2 d cba
Z +++±= α
( ) ( )]
11[ln(RR)expRRforCI
2 cd c
c
aba
a Z +−
++−±= α
284
3/1/2010
CI for OR and RR Cont.. Unbiased and Biased Estimators
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 72/116
72
CI for OR and RR Cont..• SPSS can compute OR and RR with their confidence intervalsgiven the information is fed in the following manner.• Create 3 variables in the variable view page:
– Frequency (for the four cells), – Exposure (0 as Yes, 1 as No) and – Outcome (0 as Yes, 1 as No)
• Enter the values into the data view page as mentioned above.• Weight cases based on “frequency” variable.• Do the analysis in the following manner:
– Descriptive statistics > Cross tabs > Put “exposure” as rowand “outcome” as column > Statistics > Check “risk” >Continue > Ok
– OR is given as “Odds ratio for exposure (yes/no)” – RR is given as “For cohort disease = yes” 285
• A statistic is called an unbiased estimator of a populationparameter if the mean of the sampling distribution of thestatistic is equal to the value of the parameter.
• Based on the Central Limit Theorem, the sample mean is anunbiased estimator of population mean.
• If the mean value of an estimator is either less than orgreater than the true value of the quantity it estimates, thenthe estimator is called a biased.
• A case of biased estimation is seen to occur when samplevariance, is used to estimate the population variance usingthe following formula:
286
Unbiased and BiasedEstimators Cont…
• The sample variance calculated using this formula is alwaysless than the true population variance.
• This is because sample observations are closer to eachother than population observation.
• To compensate for this, n-1 is used as the denominator.• It is important to note that, using n-1 as the denominator, the
sample variance still remains a biased estimator of thepopulation standard deviation, but for large sample sizesthis bias is negligible.
287
Estimation of Sample Size forCross Sectional Studies
Why we need to calculate sample size:
• Representativeness Vs Cost• Estimation can be made based on a given confidence
level and standard error.
288
3/1/2010
Sample Size to Estimate a SinglePop l tion Proportion
Sample Size to Estimate a SingleP l ti P ti C t
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 73/116
73
Population Proportion
2
2
2
)1(
d
PP Z
n
−
=α
• If the main objective of the study is to estimate singlepopulation proportion, then the sample size can bedetermined using the formula:
Where;n is the minimum sample size required for very large
population ( 100,000)Z is the critical value for a given confidence intervalP is expected proportion of the event to be studied (to
be estimated based findings of previous studies)d is margin of error 289
Population Proportion Cont…NB:• If p is not known it has to be taken as 0.5. (Why?)• Depending on the nature of the study 10-15%
contingency should be added.• If the size of the population is less than 100,000 the
sample size should be corrected using the formula;
• Where: – n is the non-corrected sample size – N is the size of the source population
NnNxn
sizesampleCorrected+
=
290
Sample Size to Estimate a SinglePopulation Proportion Cont…
Example 6.6:• A researcher is interested to determine the prevalence of
family planning use in Addis Ababa city. A previousstudy indicates the prevalence is around 55%. If theresearcher is interested to determine the sample sizewith 95% CI and 5% of margin of error, what number ofwomen of reproductive age should be included into hisstudy?
291
Sample Size to Estimate SinglePopulation Mean
• If the main objective of the study is to estimate singlepopulation mean, then the sample size can be determinedusing the formula:
• Where: – n is the minimum sample size required for large
population – Z is the critical value for a given confidence level– is the expected SD of the event to be studied
– d is the margin of error
2
2
=
d
Z
n
σ α
292
3/1/2010
Sample Size to Estimate SinglePopulation Mean
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 74/116
74
Population MeanExample 6.7:• A researcher is interested to determine the mean blood
glucose level among high school students. A previousstudy indicates the mean is 85mg/dl with standarddeviation of 15mg/dl. If the researcher is interested todetermine the sample size with 95% CI and tolerates 2mg/dl margin of error, what number of students shouldbe included into his study?
293
Hypothesis Testing
What is a Hypothesis
• A statistical hypothesis is an assumption or a statementwhich may or may not be true concerning one or morepopulation.
• Setting up and testing hypotheses is an essential part ofstatistical inference.
• Examples of statistical hypothesis: – The mean pulse rate among AAU-HI students is 72/min. – The prevalence of HIV in AA is 12%. – The mean blood glucose level among Chinese and
Indians is the same. – The prevalence of Hypertension in US and UK is the
same. – The mean blood cholesterol level is the same before
and after taking a drug.295
Steps in Hypothesis Testing
Hypothesis testing involves the following steps:1. Choose the hypothesis to be tested,2. Choose an alternative hypothesis which would be
accepted if the first hypothesis is rejected.3. Decide on the appropriate test statistic for the
hypothesis ( Z, t, X 2 )4. Decide the level of significance and corresponding
critical value.5. Obtain the value of the test statistic.6. Make a decision and interpret it.
296
3/1/2010
The Null and AlternativeHypothesis
The Null and AlternativeHypothesis Cont
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 75/116
75
Hypothesis• In hypothesis testing two hypotheses are involved: The Null
Hypothesis and the Alternative Hypothesis.
• Every hypothesis test requires the analyst to state a nullhypothesis and an alternative hypothesis.
• They are mutually exclusive and complementary events.
• Both hypotheses are about the parameter not about thestatistic.
• The null hypothesis (H 0 or H N): – The first hypothesis to be set by the researcher. – It commonly implies the meaning of “equals to”, “no
effect” or “no difference”, “no association” conclusions.297
Hypothesis Cont..Example;• The mean pulse rate among AAU-HI students is 72/min.• Drug A has no effect on the blood glucose level of
diabetic patients.• There is no difference in the prevalence of malaria in
region A and Region B.• There is no association between smoking and lung
cancer.
298
The Null and AlternativeHypothesis Cont..
• The alternative hypothesis (H A or H1)• The hypothesis that will be accepted if H 0 is rejected.• Implies conclusions like “is not equal”, “has effect”, “there
is difference” and “there is association”.
Example:
• The mean pulse rate among AAU-HI students is notequal to 72/min.• Drug A has effect on the blood glucose level of diabetic
patients.• There is difference in the prevalence of malaria in region
A and B.• There is association between smoking and lung cancer.
299
Test Statistic
• In hypothesis testing we accept or reject the hypothesisthrough calculating the probability of getting theestimated sample value given the hypothesizedpopulation value is true.
• If the probability is very low we reject the null hypothesis.• The probability is calculated using test statistic.• The most commonly used test statistic are Z, student’s-t
and X 2 tests.• The general formula to calculate test statistic is:
SE valueed hypothesizestimate
statistictest )()( −
=
300
3/1/2010
Test Statistic Test Statistic Cont…
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 76/116
76
Student’s t Distribution:• The use of z-test requires a knowledge of the variance of
the population from which the sample is taken.• It is somewhat strange that once can have knowledge of
the population variance and not know the value of thepopulation mean.
• In statistics as long as sample size is large enough, mostdatasets can be explained by standard normal dist.
• But when the sample size is small and population SD isnot known, statisticians rely on the distribution of the tstatistic.
301
• Student’s t distribution was developed by William Gosset(1876-1937) under the pseudonym of “Student t”.
• There are many different t distributions. (t distribution is afamily of distributions)
• The particular form of the t distribution is determined byits Degrees of Freedom (df).
• The degrees of freedom (df) refers to the number ofindependent observations in a dataset after somerestriction is made.
n
s x
t ][ µ −
=
302
Test Statistic Cont…
• The t distribution has the following properties: – The mean of the distribution is equal to 0. – Symmetrical about the mean. – The variance is equal to v / ( v - 2 ), where v is the df .
(i.e. V>2) In general the variance is greater than 1,
but approaches 1 as the sample size becomes large. – Extends from + infinity to – infinity – Compared to normal distribution, t distribution is less
picked in the center and has higher tails. – The t distribution approaches the normal distribution
as n-1 approaches infinity.303
Test Statistic Cont…
304
3/1/2010
Test Statistic Cont… Test Statistic Cont…
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 77/116
77
• For the t distribution to apply strictly we need thefollowing two assumptions:
1. The observations are selected at random from thepopulation.
2. The population distribution is normal.• Sometimes the second assumptions may not be met as
the t test is robust for departures from the normaldistribution.
• That means even when assumption 2 is not satisfied, theprobabilities calculated from the t table are stillapproximately correct.
305
Chi Square Distribution ( X 2 ):• Mainly developed by Karl Pearson (1857-1936)• A type of probability distribution like Z or t.• Represented by the Greek letter Chi ( )• It is the distribution of the sum of the squared values of
the observations drawn from the N (0,1) distribution.• Let {X 1, X 2, ..., X n } be n independent random variables,
all ~ N (0,1).
• Then the X 2 n is defined as the distribution of the sum X 1²+ X 2² +...+ X n ².
306
Test Statistic Cont…
• Mainly used to check association between twocategorical variables.
• It is the most frequently used statistical technique foranalysis of count or frequency data.
• It is not a distribution but rather a family of distributions,indexed by the df .
• The mathematical formula of X 2 distribution is given as(where x is 0):
)2 / (1)2 / (
2
)2
1(
)!12
(
1 xk k e x
k Y −−
−=
307
Test Statistic Cont…
• The graph is given as:
308
3/1/2010
Test Statistic Cont… Test Statistic Cont…
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 78/116
78
• The formula for the test statistic which approximates X 2
distribution is: (where O is the observed frequency and Eis expected frequency)
• It has the following characteristics: – Extends indefinitely to the right from 0. – Has only one tail.
– As the df increase, the chi-square curve approachesa normal distribution.
309 310
Errors in Hypothesis Testing• In testing hypothesis, two types of errors can be
committed: Type I and Type II errors.
• The probability of committing type I error is denoted as. It is also called the Level of significance. (1-
confidence level)• The probability of committing type two error is denoted
as . (1-power of the study)
Decision of thehypothesis testing
Accept H 0 Reject H 0
NullHypothesis H0 True Correct Type I errorH0 False Type II error Correct
311
One and Two Tailed Hypothesis
• Some hypotheses test whether one value is differentfrom another or not, without additionally predicting whichwill be higher: Non-directional or two-tailed test
• At times some hypotheses not only test difference of onevalue from the other but also direction of the difference.
i.e. it would be lower or higher: Directional or one-tailedtest.
312
3/1/2010
Level of Significance, CriticalValues and Critical Area
Level of Significance, CriticalValues and Critical Area
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 79/116
79
• In practice, the level of significance ( ) is chosen arbitrarily.• Three levels 0.01, 0.05, or 0.10. (depending on confidence
level)• The smaller the level of significance, the stronger the
hypothesis test.• The level of significance determines the values of the test
statistic that would cause us to reject the hypothesis.• The corresponding test statistic values for the level of
significance are called the Critical Values.• In a probability distribution the area which is left to the
extreme right or/and left of the critical value is called theCritical area (Rejection area).• The area between the two critical values is called the
Acceptance Area.313 314
Level of Significance, CriticalValues and Critical Area
• A level of significance has different critical values for oneand two tailed test,
• Level of significance of 0.05 has critical value of ±1.96 ifthe test is two tailed.
• However if the test is one tailed the critical value wouldbe 1.64 to either of the tails.
• Note that critical values for a given level of significancediffer depending on the test statistic intended to be used.
315
Level of Significance, CriticalValues and Critical Area
316
3/1/2010
Level of Significance, CriticalValues and Critical Area
Level of Significance, CriticalValues and Critical Area
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 80/116
80
317 318
Level of Significance, CriticalValues and Critical Area
(level ofsignificance)
Two tailedtest
On tailedtest, <
On tailed test,>
0.10 ±1.64 -1.28 1.280.05 ±1.96 -1.64 1.640.01 ±2.58 -2.33 2.33
319
Interpretation and Conclusion
• Interpretation is made based on comparisons between: – Test Statistic Calculated Vs Critical Value. – P value Vs significance level.
• Conclusion (i.e. accepting and rejecting the null
hypothesis) should be made at the given level ofconfidence.
320
3/1/2010
Test of Hypothesis about SinglePopulation Mean
Test of Hypothesis about SinglePopulation Mean Cont..
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 81/116
81
• Shows how to test the null hypothesis that the populationmean is equal to some hypothesized value.
• One begins with a statement that claims a particularvalue for the unknown population mean.
• The hypothesis testing for single population mean eitheraccepts or rejects this statement.
• The Z test and the t test used. – Sample > 30: Z test
– Sample < 30 and population SD known: Z test – Sample < 30 and population SD unknown: t test
321
n
X Z
/ σ µ −
=
nS
X t
/
µ −=
322
Test of Hypothesis about SinglePopulation Mean Cont..
Example 7.1:• Researchers are interested in the mean level of an
enzyme in a certain population. They take a sample of36 individuals, determine the level of enzyme in eachand compute a sample mean 22. It is known that thevariable of interest is approximately normally distributedwith a standard deviation of 10. Let’s say that they areasking the following question: Can we conclude that themean enzyme level in this population is different from25?
323
Test of Hypothesis about SinglePopulation Mean Cont..
• Step 1 and 2 : Define the H o and H 1:
• Step 3 : Decide approprate test statistic: – Z test
• Step 4 : Decide the level of significance and critical value:
– value of 0.05. – ±1.96 is the critical value.
• Step 5: Obtain the value of the test statistic:
25: = µ o H 25:1 ≠ µ H
324
3/1/2010
Test of Hypothesis about SinglePopulation Mean Cont..
Test of Hypothesis about SinglePopulation Mean Cont..
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 82/116
82
n
X Z
/ σ µ −
=
36 / 10
2522 −= Z
1.673−
= Z
80.1−= Z
325
• Step 6: Make a decision and interpret it. – Accept the H 0 at 95% confidence level: – 1.80 is with in the acceptance region. – P value of 0.036 is > /2 value of 0.025.
326
Test of Hypothesis about SinglePopulation Mean Cont..
Example 7.2:• The researchers mentioned in example 7.1, instead of
asking if they could conclude that µ≠ 25, they asked: Canwe conclude that the mean enzyme level in thispopulation is less than 25?
Solution:• Step 1 and 2: Define the H 0 and H 1:
25: ≥ µ o H
25:1 < µ H
327
Test of Hypothesis about SinglePopulation Mean Cont..
• Step 3 : Decide approprate test statistic: – Z test
• Step 4 : Decide the level of significance and criticalvalue: – value of 0.05.
– ±1.645 is the critical value.• Step 5: Obtain the value of the test statistic:
n
X Z
/ σ µ −
=36 / 10
2522 −= Z
1.673−
= Z 80.1−= Z
328
3/1/2010
Test of Hypothesis about SinglePopulation Mean Cont..
Test of Hypothesis about SinglePopulation Mean Cont..
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 83/116
83
• Step 6: Make a decision and interpret it. – Reject the H0 with 95% confidence level – Test statistic -1.80 is with in the acceptance region.
– P value of 0.036 is less than the value of 0.05.
25≥ µ
329
Example 7.3:• Serum Amylase level determination was made on a
sample of 15 apparently health subjects. The sampleyielded the mean of 96 units/100 ml and a standarddeviation of 35 units /100 ml. The variance of thepopulation was unknown. We want to know wheter wecan conclude that the mean of the population is differentfrom 120 units/100 ml.
330
Test of Hypothesis about SinglePopulation Mean Cont..
• Step 1 and 2 : Define the H 0 and H 1.
• Step 3: Decide approprate test statistic. – t test
• Step 4: Decide level of significance and critical value.
– value of 0.05. – t value for of 0.0025 at df of 14: ±2.145
• Step 5: Obtain the value of the test statistic.
120: = µ o H 120:1 ≠ µ H
nS
X t
/
µ −=
15 / 35
12096 −=t 65.2−=t
331
Test of Hypothesis about SinglePopulation Mean Cont..
• Step 6: Make a decision and interpret it.• We reject the null hypothesis b/c
– The cal test statistic -2.65 is in the rejection area – The corrspoinding P value of -2.65 (b/n 0.01 and
0.005) is less than the /2 value of 0.025.
332
3/1/2010
Testing of Hypothesis about TwoPopulation Means
Testing of Hypothesis about TwoPopulation Means Cont..
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 84/116
84
• Compare the difference between two populations mean.• H0: there is not difference between the two mean.• H1: there is difference between the two means.• Z or t test can be employed.
• Sum-up the sample size of the two groups, if it is greaterthan 30 use Z test, if less than 30 use t test.
2
22
1
21
21 )(
nn
X X Z
σ σ +
−=
333
• t test is carried out with df of n1+n2-2
2
2
1
2
21 )(
nS
nS
X X t
+
−=
2)1()1(
21
222
211
−+−+−
=nn
SnSnS
334
Testing of Hypothesis about TwoPopulation Means Cont..
Example 7.4:• A researcher wants to check whether the systolic blood
pressure among males is different from females or not.Among 50 male samples the mean SBP was 100mmHgwith standard deviation of 5 mmHg. Among 60 females,the mean SPB was 104mmHg with standard deviation of
10 mmHg. Is there significant difference between the twomeans?
335
Testing of Hypothesis about TwoPopulation Means Cont..
• Step 1 and 2: Define the H 0 and H 1
• Step 3 : Decide approprate test statistic: – Z test
• Step 4 : Decide the level of significance and criticalvalue:
– value of 0.05. – ±1.96 is the critical value.
• Step 5: Obtain the value of the test statistic:
f mo H µ µ =: f m H µ µ ≠:1
2
22
1
21
21 )(
nn
X X Z
σ σ +
−=
6010
505
10410022
+
−= Z
67.15.0
4+
−= Z 72.2
7.14
−=−
= Z
336
3/1/2010
Testing of Hypothesis about TwoPopulation Means Cont..
Testing of Hypothesis about TwoPopulation Means Cont..
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 85/116
85
• Step 6: Make a decision and interpret it. – We reject the H0 and accept the H1 (at 95%
confidence level) b/c – The cal test statistic -2.72 is in the rejection region. – The corrspoinding P value of -2.72 (0.0033) is less
than the value of 0.025.
f m µ µ ≠
337
Example 7.5:• Serum amylase determination was made on a sample of
15 apparently health subjects and 12 hospitalizedsubjects. Among health subjects, the mean was 96units/100ml with standard deviation of 35 units/100 ml.Among hospitalized patients, the mean was 120units/100ml with standard deviation of 40 units/100 ml. Isthere significant difference between the two meanvalues?
338
Testing of Hypothesis about TwoPopulation Means Cont…
• Step 1 and 2: Define the H 0 and H 1
• Step 3: Decide approprate test statistic. – t test
• Step 4: Decide level of significance and critical value.
– value of 0.01. – t value for /2 of 0.005 at df of 25: ±2.787
• Step 5: Obtain the value of the test statistic
21: µ µ =o H 211 : µ µ ≠ H
3.37139025
176001715025
)40)(11()35)(14(2
)1()1( 22
21
222
211 ==
+=
+=
−+−+−
=nn
SnSnS
339
Testing of Hypothesis about TwoPopulation Means Cont…
• Step 6: Make a decision and interpret it.• We accept the null hypothesis (at 99% confidence level)
b/c:• The calculated test statistic -1.67 is in the acceptance
region.• The corrspoinding P value of -1.67 (which is b/n 0.1 and
0.05) is greater than the value of 0.005 .
67.14.14
24
123.37
153.37
1209622
−=−
=
+
−=t
340
3/1/2010
Testing of Hypothesis about TwoPopulation Means Cont…
Testing of Hypothesis about TwoPopulation Means Cont…
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 86/116
86
Paired t test for difference between two means:• Every observation in one sample has one matching
observation in the second sample.• Commonly used in evaluation of interventions like new
treatment modalities.• Hence pre and post intervention (treatment) results are
compared.• Usually t test is used since individuals involved in the
trial are few.
• The null hypothesis: there is no significant differencebetween the two tests.
341
• Procedures of hypothesis testing are the same. Exceptthe formula for the test statistic calculation.
– d = mean of differences between the two samples. – SD = is the standard deviation for the differences
between the two samples. – n = the number of paired cases.
• Note that the calculated test statistic is compared atdegree of freedom of n-1.
nSD
d t =
342
Testing of Hypothesis about TwoPopulation Means Cont…
Example 7.6:• A random sample of 10 young men was taken and the
pulse rate was measured before and after taking a cupof coffee. The result is given as follows. Does the coffeehas any effect on the heart rate? (perform the hypothesistesting with 95% CI)
343
Testing of Hypothesis Cont…
Subject PR before PR after Difference1 68 74 +62 64 68 +43 52 60 +84 76 72 -45 78 76 -26 62 68 +67 66 72 +68 76 76 09 78 80 +2
10 60 64 +4Mean 68 71 +3 344
3/1/2010
Testing of Hypothesis about TwoPopulation Means Cont…
ff ff
Test of Hypothesis About SinglePopulation Proportion
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 87/116
87
• H0: Coffee intake has no effect on PR• H1: Coffee intake has effect on PR• Test statistic: t test (paired)• Critical value ±2.262• First calculate the SD then the test statistic:
• Reject the null hypothesis (at 95% confidence level)
• Coffee intake has effect on PR.
92.31
)( 2
=−
−
n
d di 4.2
1092.3
3 ==t
345
• The null hypothesis that the population proportion isequal to some hypothesized value.
• One begins with a statement that claims a particularvalue for the unknown population proportion.
• The hypothesis testing for single population proportioneither accepts or rejects this statement.
• Here Z test statistic is used. The formula is given as:
n
p Z
)1( π π π
−
−=
346
Test of Hypothesis on MeansUsing SPSS
• In SPSS One sample T test, independent T test andpaired sample T test are available under;
• Analyze > means > One sample T test or independent Ttest or paired sample T test
347
Test of Hypothesis About SinglePopulation Proportion
Example 7.7:• A survey was conducted to determine the prevalence of
protein energy malnutrition in a rural kebele. Of 300under five children assessed, 123 were stunted. Can weconclude that the prevalence of PEM in the population is50%?
348
3/1/2010
Test of Hypothesis About SinglePopulation Proportion
S 1 d 2 D fi h H d H
Test of Hypothesis About SinglePopulation Proportion
S 6 M k d i i d i i
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 88/116
88
• Step 1 and 2: Define the H 0 and H 1
• Step 3: A pproprate test statistic: – Z statistic
• Step 4: Decide the level of significance and thecorresponding critical value: – Let’s take value of 0.1. Hence ±1.645 is the critical
value.
• Step 5: Obtain the value of the test statistic:
5.0: =π o H
11.3
30025.0
09.0
300)5.0(5.0
5.041.0
)1(−==
−=
−
−=
n
p Z
π π π
5.0:1 ≠π H
349
• Step 6: Make a decision and interpret it.• At 90% confidence level wee reject the null hypothesis
that P=0.5. – The calculated test statistic -3.11 is in the rejection
region. – The corrspoinding P value of -3.11 (i.e. 0.0009) is
less than the value of 0.05.
350
Testing of Hypothesis AboutTwo Population Proportions
• The null hypothesis that a population proportion is equalto another population proportion.
• The hypothesis testing for single population proportioneither accepts or rejects this statement.
• Here Z test statistic is used. The formula is given as:
+−
−=
21
21
11)1(
nn pP
p p Z
21
2211
nn pn pn
P++
=
351
Testing of Hypothesis AboutTwo Population Proportions
Example 7.8:• The prevalence of malaria among two malaria endemic
kebeles X and Y was compared. In kebele X among 120samples 15 were positive. In kebele B among 100samples 20 were positive. Is there any significantdifference between the prevalence of malaria kebele X
and Y?
352
3/1/2010
Testing of Hypothesis AboutTwo Population Proportions
• Step 1 and 2: Define the H 0 and H 1:
Testing of Hypothesis AboutTwo Population Proportions
2211 pnpn + )20(100)1250(120 + 2015 +
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 89/116
89
• Step 1 and 2: Define the H 0 and H 1:
• Step 3: Decide approprate test statistic: – Z statistic
• Step 4: Decide value & the critical value: – Let’s take value of 0.05. Hence ±1.96 is the critical
value.
• Step 5: Obtain the value of the test statistic: – First calculate the proportions & the pooled proportion – P1 = 15/120 = 0.125, P2 = 20/100 = 0.2
21: PP H o = 211 : PP H ≠
353
• Then we calculate the test statistic:
• Step 6: Make a decision and interpret it.
At 95% confidence level we accept the H0 P1=P2 b/c: – -1.51 is in the acceptance region. – - 0.0655 is greater than the value of 0.025.
21
2211
nn pn pnP
++=
100120)2.0(100)125.0(120
++=P 159.0
2202015 =+=P
+−
−=
1001
1201
)159.01(159.0
2.0125.0 Z
( )51.1
0.01830.1337
075.0 −=−
= Z
354
Test of Hypothesis onProportions Using SPSS• There is no “point and click” option in SPSS to do such
hypothesis testing on proportions.
• Syntax based analysis can be done.
355
Test of Hypothesis aboutCategorical Data• It is also possible to apply hypothesis testing on
categorical data.• The Chi-square (
2 ) test statistic commonly used.• This test is usually applied to tabulated data.• The table contains two variables called the row and
column variables.
• The test measures the discripancy between K observedfrequencies (O) and correspoinding K expectedfrequencies (e). i.e. for all cells of the tabulation.
• Expected frequencies are frequencies which happenwhen there is no association between the raw andcolumn variables.
356
3/1/2010
Test of Hypothesis aboutCategorical Data
• The H 0 of Chi square test is there is no association
Test of Hypothesis aboutCategorical Data
• Assumptions of Chi square test:
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 90/116
90
• The H 0 of Chi-square test is there is no associationbetween the row and column variables.
• While the H 1 is there is associaiton between the row andcolumn variables.
• The closer observed frequencies are to expectedfrequencies, the more likely the H0 is true.
=
−=
k
i i
ii
eeO
x1
22 )(
totalgrand cellthe for totalcolumn xcellthe for totalrowe =
357
• Assumptions of Chi-square test: – No cell of the table has expected frequency less than
1, – No more than 20% of the the expected frequencies
should be less than 5.• Chi-square test should compaired with chi-square
disribution with df of (R-1)(C-1).• Though the distribution of Chi-square is one tailed, the
test is always two tailed.
358
Test of Hypothesis aboutCategorical DataExample 7.9:• A researcher is interested to assess the effect of litracy
on family planning use. Accordingly he collected dataand tabulated the findings in the following manner. Canwe say there is association between educational statusand family planning use?
FP use Educational StatusIlliterate Literate Total
Yes 63 49 112
No 15 33 48
Total 78 82 160359
Test of Hypothesis aboutCategorical Data• Step 1 and 2: Define the H 0 and H 1:
– H0: There is not association between litracy andfamily planning use.
– H1: There is association between litracy and familyplanning use.
• Step 3: Decide approprate test statistic: – X 2 test.
• Step 4: Decide and the corresponding critical value: – Let’s take value of 0.01. – At df of 1 the critical value is 6.635. – Accptance area is 0-6.635, Rejection area X 2 > 6.635.
360
3/1/2010
Test of Hypothesis aboutCategorical Data
• Step 5: Obtain the value of the test statistic:
Test of Hypothesis aboutCategorical Data
− − − − )6.2433()4.2315()4.5749()6.5463( 22222
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 91/116
91
Step 5: Obtain the value of the test statistic: – First the expected frequency should be calculated:
• Expected frequency for cell a: 78 x 112/160 = 54.6• Expected frequency for cell b: 82 x 112/160 = 57.4• Expected frequency for cell c: 78 x 48/160 = 23.4• Expected frequency for cell d: 82 x 48/160 = 24.6
– Assumptions of X 2 test fulfilled. – Then we calculate the Chi-square statistic.
=
−=
k
i i
ii
e
eO x
1
22 )(
361
• Step 6: Make a decision and interpret it.• At 99% confidence level we accept the H 1 that the two
variables are associated due to the following reasons: – The calculated test statistic 8.41 is in the rejection area. – The corrspoinding P value of 8.41 (between 0.005 and
0.002) is less than the value of (0.01).
• But how is the direction of association?
+
+
+
=
6.24)(
4.23)(
4.57)(
6.54)(2 x
( ) ( ) ( ) ( ) 41.887.202.323.129.12 =+++= x
362
Test of Hypothesis aboutCategorical Data Using SPSS• In order to do chi-square test using SPSS, track the
following steps.
• Analyze > Descriptive Statistics > Cross tab > Put the twocategorical variables as column and row > Statistics >Check “Chi-square” > Ok.
• Chi-square test is given in a table as “Pearson Chi-square”.
363
Fisher's exact test
• Fisher's exact test is a statistical significance test used in theanalysis of contingency tables when sample size is small.(when assumption of chi square test are not fulfilled)
• It is named after its inventor, R. Fisher.• For hand calculations, the test is only feasible in the case of a
2 x 2 contingency table.• Its application to higher order tables is controversial.• H0: there is no association between the two variables• H1: there is association between the two variables• The hypothesis is tested by comparing the probability of
observing the given or more extreme tables with the level ofsignificance, given the null hypothesis is true. 364
3/1/2010
Fisher's exact test
a b (a+b)
Fisher's exact test
• Hypothesis testing using fisher’s exact test involves the
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 92/116
92
• The exact probability of observing a given table is given as:• = [(a+b)!(c+d)!(a+c)!(b+d)!]/[N!a!b!c!d!]
a b (a+b)c d (c+d)
(a+c) (b+d) N
365
Hypothesis testing using fisher s exact test involves thefollowing steps:
1. Calculate the probability of the observed table itself,2. List all possible extreme tables manually (given the
marginal totals are maintained),3. Calculate their respective exact probability,4. Calculate the probability of getting observed or more
extreme tables,5. Multiply the total by 2 (to get 2 tailed value)6. Compare the value with value of
366
Fisher's exact test
Example 7.10:• In the following tabulated data, Is there any
association between the treatment type and survivalrate of patients? (Test the hypothesis at 95%confidence level)
Treatment type Survived Died TotalA 7 2 9
B 5 6 11
Total 12 8 20
367
Fisher's exact test
• H0: No association between the treatment modalities andsurvival rate.
• H1: There is association between the treatmentmodalities and survival rate.
• Test statistic: F exact test b/c two of the expectedfrequencies have values less than 5.
• Level of significance: 5%• Calculate the probability of getting the given or more
extreme tables.
368
3/1/2010
Fisher's exact test
• Observed table:
Fisher's exact test
• First possible extreme table:
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 93/116
93
• Probability of observing this table = 9!11!12!8!/20!7!2!5!6!= 0.132
Treatment type Survived Died TotalA 7 2 9B 5 6 11
Total 12 8 20
369
p
• Probability of observing this table = 9!11!12!8!/20!8!1!4!7!= 0.024
Treatment type Survived Died TotalA 8 1 9B 4 7 11
Total 12 8 20
370
Fisher's exact test• Second possible extreme table:
• Probability of observing this table = 9!11!12!8!/20!9!0!3!8!= 0.001
Treatment type Survived Died TotalA 9 0 9B 3 8 11
Total 12 8 20
371
Fisher's exact test
• Probability of getting the observed or more extremetables: – 0.132 + 0.024 + 0.001 = 0.157 (one tailed) – Two tailed 2 x 0.157 = 0.314
• Conclusion and interpretation: – Accept the null hypothesis at 95% confidence level – There is no association between the treatment
modalities and survival rate.
372
3/1/2010
Fisher's exact test usingSPSS
• In order to do Fisher’s exact test using SPSS, track thefollowing steps
Summary
• The interpretation of the hypothesis test is dependent on theconfidence level at which the test is conducted
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 94/116
94
following steps.
• Analyze > Descriptive Statistics > Cross tab > Put thetwo categorical variables as column and row > Statistics> Check “Chi-square” > Ok.
• Fisher’s exact test is given in a table titled “Chi-squaretests”.
• NB: SPSS doesn’t do Fisher’s exact test for higher ordertables. 373
confidence level at which the test is conducted.• A hypothesis which is accepted at a lower level of confidence
can not be rejected at a higher level of confidence.• A hypothesis which is rejected at a lower level of confidence
can be accepted at a higher level of confidence.• A hypothesis which is rejected at a higher level of confidence
can not be accepted at a lower level of confidence.• A hypothesis which is accepted at a higher level of confidence
can be rejected at lower level of confidence.
374
Sample Size Calculation forComparative Studies.• The concept discussed in this chapter can be applied to
the calculation of sample size for comparative studies.• For comparative studies like case control, cohort,
interventional ,optimal size for the two groups iscalculated using the formula;
• Where
221
2211
21 )(
)1()1()1()
11(
PP
r PP
PP Z pPr Z n
−
−+−+−+
= β α
r
rPPP
++
=1
21
375
Sample Size CalculationCont..• Were;
P is the pooled proportionP1 is the expected 1 st proportionP2 is the expected 2 nd proportionr is the number of controls per a case
Alpha is the probability of type I errorBeta is the probability of type II errorn1 is sample size for the first group
NB: n2 is calculated by multiplying n 1 by r.
376
3/1/2010
Regression and Correlation
• Many medical investigations are concerned with:E t bli h t f l ti hi b t t i bl
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 95/116
95
Correlation and LinearRegression
– Establishment of relationship between two variables. – The strength of a relationship. – Predicting one variable on the basis of another. – Controlling the effect of unwanted variables.
• Such intentions can be addressed either by usingcorrelation or regression analysis.
378
Correlation Analysis• Initially developed by Sir Francis Galton (1888) and Karl
Pearson (1896)• Correlation is the quantification of the degree to which two
random quantitative variables are related provided therelationship is linear.
• Both of the variables should be measured on the sameset of study units.
• Strength of relationship measurement: CorrelationCoefficient.• Most commonly used coefficients: Product Momentum
Correlation or Pearson Correlation Coefficient (r).• The symbol rho ( ) used to represent population
correlation coefficient• Unit less measure.
ρ
379
Correlation Analysis• Does not imply cause and effect relationship.• The value of r ranges from -1 to +1.• If the correlation coefficient is greater than 0, the
variables are said to be positively correlated (i.e. as Xincreases, Y tends to increase).
• If the correlation coefficient is less than 0, the variablesare said to be negatively correlated (i.e. as X increases,
Y tends to decrease).• If the correlation coefficient is 0 then the variables aresaid to be uncorrelated.
380
3/1/2010
Correlation Analysis Cont…
• The formula for computing sample correlation coefficient(r) for two variables X and Y is given as:
Correlation Analysis Cont…
Linear relationships Curvilinear relationships
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 96/116
96
• Or
• Before computing r, scattered plot between the twovariables should be drawn. Why?
−−
−−=
])(][)([
))((22 y y x x
y y x xr
−−
−=
])()(][)()([ 2222 y yn x xn
y x xynr
381
y
x
y
x
y
y
x
x382
Correlation Analysis Cont…
y
x
y
x
y
y
x
x
Strong relationships Weak relationships
383
Correlation Analysis Cont…
y
x
y
x
No relationship
384
3/1/2010
Correlation Analysis Cont…
• Assumptions of correlation analysis:– Independent random samples are taken
Correlation Analysis Cont…
Example 8.1:• The data of a random sample of 20 countries are shown
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 97/116
97
Independent random samples are taken – Both variables are on interval/ratio scale – Linear association between X and Y – Paired measures for X and Y – Normal distribution for X and Y – Homogeneity of variance (Homoscedasticity)
• In situations where its assumptions are violated,correlation becomes inadequate to explain a givenrelationship.
385
The data of a random sample of 20 countries are shownin the following table. X represents the percentage ofchildren immunized by age one year and Y representsthe under five year mortality rate. Determine the strengthof association between the two variables.
386
Correlation Analysis Cont…
387
Country % Immunized (X) CMR/1000LB (Y) XY Y 2 X2
Bolivia 77 118 9086 13924 5929Brazil 69 65 4485 4225 4761Cambodia 32 184 5888 33856 1024Canada 85 8 680 64 7225China 94 43 4042 1849 8836Czech 99 12 1188 144 9801Egypt 89 55 4895 3025 7921Ethiopia 13 208 2704 43264 169Finland 95 7 665 49 9025France 95 9 855 81 9025
Greece 54 9 486 81 2916India 89 124 11036 15376 7921Italy 95 10 950 100 9025Japan 87 6 522 36 7569Mexico 91 33 3003 1089 8281Poland 98 16 1568 256 9604Russia 73 32 2336 1024 5329Senegal 47 145 6815 21025 2209Turkey 76 87 6612 7569 5776UK 90 9 810 81 8100Total 1548 1180 68626 147118 130446
Correlation Analysis Cont…
• There is strong linear relationship between the twovariables.
−−
−=
])()(][)()([ 2222 y yn x xn
y x xynr
])1180()147118(20[])1548()130446(20[
)11801548()68626(2022 −−
−=
x
xr
79.0−=r
388
3/1/2010
Correlation Analysis Cont…
• Interpretation option:– 100% r 2:
Correlation Analysis Cont…
• Hypothesis Testing for a Correlation Coefficient• As that of mean and percentage, it is also possible to
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 98/116
98
100% r :• Shows proportion of variation of a variable
explained by the other. – Rule of thumb:
389
As that of mean and percentage, it is also possible totest significance about population correlation.
• For two tailed test – H0: r is 0 – H1: r is different from 0
• The t test statistic is given as (with n-2 df):
212
r nr t −−=
390
Correlation Analysis Cont…
Example 8.2:• At the 0.05 level of significance, can we claim the
correlation coefficient in example 8.1 indicates significantnegative relationship between immunization coverageand child mortality?
391
Correlation Analysis Cont..
• The critical t value for 0.05 level of significance at 18degree of freedom is - 1.734. Then we calculate the teststatistics.
• Hence we accept the H 1 that r indicates significantnegative relationship between immunization coverageand child mortality.
5.47)0.3759
18(79.0)
)79.0(1220
(79.01
222
−=−=−−
−−=
−−
=r
nr t
392
3/1/2010
Correlation Analysis Cont..
Limitations:• Applied only to a linear relationship.
Correlation Analysis Cont..
Spearman’s Rank Correlation• It is a nonparametric (distribution-free) rank statistic
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 99/116
99
pp y p• One must not extrapolate an observed correlation
beyond observed ranges of the x and y value.• Does not differentiate dependent and independent
variable.• Confounding by a third variable.
393
p ( )proposed by Charles Spearman in 1904 as a measure ofthe strength of the associations between two variables
• Denoted as r s
• Is applied when:• Normality assumption is not satisfied or can not be
tested,• At least one of the variable is given in ordinal scale,
• In the calculation of the coefficient, actual values of bothvariables should be changed into ranks.
394
Correlation Analysis Cont..• The formula for the Spearman Correlation Coefficient is
(given that there is no tied rank):
• Where; – 6 is a constant, – D is the difference between a subjects ranks on the
two variables, – n is the number of subjects.
• Consider the following example.
)1(
)(61 2
2
−−=
nn
Dr s
395
Correlation Analysis Cont..Countries
MMR(Per100,00
0LB)
MMRRank
DeliveryService
Coverage(%)
Rank D D 2
1 315 4 55 6 -2 4
2 450 6 40 5 1 1
3 200 1 70 8 -7 49
4 250 3 79 10 -7 49
5 243 2 75 9 -7 49
6830
9 25 3 6 36
7 850 10 20 2 8 64
8 656 7 20 1 6 36
9 701 8 30 4 4 16
10 410 5 60 7 -2 4
308
The following tablepresents the MMR leveland delivery servicecoverage in 10 developingcountries.
= 1- [(6x308)/10(100-1)]= 1-[1848/990]= 1-1.87= -0.87
)1(
)(61 2
2
−−=
nn
Dr s
396
3/1/2010
Correlation Analysis Cont..• Inference about r s
• For hypothesis testing t score can be calculated (at df ofn-2) using the formula;
Correlation Analysis Cont..
Partial Correlation• A method used to describe the relationship between two
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 100/116
100
n 2) using the formula;
• For the previous example the t score would be;
• If the hypothesis test is a two tailed test at 0.05 level ofsignificant, we reject the H 0 as 5 > 2.306.
21 2
−−
=
n
r
r t
s
s
5
210
)87.0(1
87.02
=
−
−−
−=t
397
variables while taking away the effects of anothervariable, or several other variables, on this relationship.
• Still requires meeting all the usual assumptions ofPearsonian correlation.
• But the covariate may not be necessary numeric.
398
Correlation Analysis UsingSPSS• In order to do correlation analysis using SPSS follow the
following steps;• Analyze > Correlate > Bivariate correlations > Put the
two variables in the variable box > Select Pearson orSpearman (another option is also there) > OK.
• Partial correlation can also be done.
• Analyze > Correlate > Partial correlation.• But before that, don’t forget the scattered plot.
399
Regression Analysis
• In correlation analysis the interest is to show how twonumeric variables are related.
• However in regression analysis, we are interested inexplaining or modeling a dependent variable (Y) as afunction of one or more independent variables (X).
• Regression analysis is used to:
– Assess association between two variables. – Predict/explain the value of a dependent variable
based on the value of at least one independentvariable. (i.e. Mathematical modeling)
– Control for confounding factors. – Show possible effect of interaction among variables. 400
3/1/2010
Regression Analysis Cont..• The general regression equation is given as:
Y = + 1X1+ 2X2……. nXn
Linear Regression
• Also known as linear least squares regression.• It is by far the most widely used modeling method.
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 101/116
101
Where: Y is the value of the dependent variable,X is the independent variable,
is the intercept,is the coefficient of the independent variable
• If the equation has only one independent variable theregression is called Simple Regression
• If multiple independent variables are involved it is calledMultiple Regression.
• In public health the most commonly used types ofregression analysis are: Linear and Logistic Regression
401
• The dependent variable is assumed to be a linearfunction of one or more independent variables plus anerror introduced to account for all other factors.
• Where Y is the dependent variable, Xs are theindependent variable and E is the random error term.
• The DV (Y) is given in continuous numeric scale whilethe IV/s (X) can be of any type. (mostly numeric variable)
ε β β β α ++++= nn x x xY .........2211
402
Linear Regression Cont..• The equation provides what value the DV would have for
a given value/s of the IV/s.• For example if we develop a linear model with the DV of
body height and the IV of serum growth hormone, wecan predict height for a person with a given value ofserum GH.
• Can be simple or multiple regression.• It attempts to model the relationship between the
dependent and independent variables by fitting a linearequation to observed data.
403
Linear Regression Cont..• A scattered plot is helpful to assesses the presence of
linear trend of association.• Consider the following data showing the number of
households in China with TV.
404
3/1/2010
Linear Regression Cont..
• If we plot these data, we get the following graph.
Linear Regression Cont..
• Although no straight line passes exactly through thesepoints, there are many straight lines that pass close tothem Here is one of them
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 102/116
102
405
them. Here is one of them.
406
Linear Regression Cont..• How would you draw a line through the points? How do
you determine which line ‘fits best’?• The most common method for fitting a regression line is
the method of least-squares.• This method calculates the best-fitting line for the
observed data by minimizing the sum of the squares of
the vertical deviations from each data point to the line.• “Best fit” means difference between actual Y values &
predicted Y values are minimum.• Hence, linear regression is a method of finding the linear
equation that comes closest to fitting a collection of datapoints.
407
Linear Regression Cont..
ε 2
Y
X
ε 1 ε 3
ε 4
^^
^^
Y X 2 0 1 2 2==== ++++ ++++
β ββ β β ββ β ε εε ε
Y X i i ==== ++++ β ββ β β ββ β 0 1
L S m in im iz e s
ε εε ε ε εε ε ε εε ε ε εε ε ε εε ε i
i
n 2
112
22
32
42==== ++++ ++++ ++++
====
408
3/1/2010
Linear Regression Cont..
• Suppose that we used the line rather than the datapoints to estimate the number of households with TV.• Then we would get slightly different values from the
Linear Regression Cont..
• The better our choice of line, the closer the predictedvalues will be to the observed values.• The difference between the predicted value and the
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 103/116
103
• Then we would get slightly different values from theoriginal observed values shown above. These values arecalled predicted values.
Year (X)(0 represents 2000)
Households with TV (millions)Observed Values
Households with TV (millions)Predicted Values Residual
0 68 62 6
1 72 70 2
2 80 78 2
3 83 86 -3
409
• The difference between the predicted value and theobserved value is called the residue.
• Residue = Observed Value - Predicted Value• The best line is the line with the smallest sum of squares
of error (SSE). (i.e. list square estimation)• SSE = Sum of squares of residues = Sum of (y observed –
y predicted )2
410
Linear Regression Cont..• The manual calculation for the coefficients of linear
regression is possible when we have one independentvariable. i.e.:Y = + X
• As that of correlation analysis, here we should have aset of paired DV and IV values for all study units.
• The line which represent the dataset (Y = + X) iscalculated using the formula:
•
−
−=
])(
[
][2
2
n
x x
n
y x xy
β x y β α −=
411
Linear Regression Cont..
!
"
• Consider the following data.
• First we should plot a scattered diagram.
412
3/1/2010
Linear Regression Cont..
Y
Linear Regression Cont…
( )( )101511
=
Y X n
ii
n
iin
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 104/116
104
01234
0 1 2 3 4 5 6X
413
( )( )
( )
( )( ) 10.0370.02ˆˆ
70.0
515
55
51015
37ˆ
10
2
1
2
12
11
11
−=−=−=
=−
−=
−
−=
=
=
==
=
X Y
n
X
X
nY X
n
i
n
ii
i
ii
iii
β β
β
414
Linear Regression Cont…• One of the indices to measure model goodness of fit for
simple linear regression is R-squared or coefficient ofdetermination.
• It is the proportion of variation explained by the best linemodel.
• It depends on the ratio of sum of square error from the
regression model (SSE) and the sum of squares differencearound the mean (SST = sum of square total).
• Where:415
Linear Regression Cont…• For multiple linear regression adjusted r squared is used.• For general rule of thumb, the R-squared or adjusted R-
squared should be higher than 0.80 to produce a goodlinear model.
• If your R-squared is less than 0.5, it is recommendedthat you consider other type of model rather than linear
model.
416
3/1/2010
Linear Regression Cont…
Interpretation of linear regression coefficient:• Let’s consider the following simple linear reg equation;• Y = + X
Linear Regression Cont…
Example 8.3:• Assume that the duration of breast feeding in weeks (Y)was found to be positively correlated with maternal age
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 105/116
105
• Y = + X• represents the slope, and represents the y-intercept.• The slope represents the estimated average change in Y
when X increases by one unit.• The intercept represents the estimated average value of Y
when X equals zero. (Practically less important)• When we represent a binary independent variable (coded
as 0-1), the slope represents the estimated averagechange in Y when you switch from 0 to 1.
417
was found to be positively correlated with maternal agein years(X). A linear regression model was developed toexplain the association. The equation is given as Y =5.92 + 0.389X. How do you want to explain theequation?
418
Linear Regression Cont…Assumptions:• Normal distribution: Regression assumes that variables
have normal distributions.• Homoscedasticity: The variance of the error terms is
constant for each value of x.• Linearity: The relationship between each x and y is linear.
• Normally distributed error terms: The error terms follow thenormal distribution.• Independence of error terms: Successive residuals are not
correlated.• No multicolinarity: The independent variables are not
correlated each other.419
Linear Regression Cont…Hypothesis testing in linear regression:• Questions to be answered through the hypothesis testing
are: – Does the entire set of independent variables contribute
significantly to the prediction of y? – Does the addition of one particular variable of interest
add significantly to the prediction of y achieved by theother independent variables already in the model?• The null and alternative hypothesis are given as:
– H0: 1 = 2 = · · · = p = 0 – H1: j 0 for at least one j.
420
3/1/2010
Linear Regression Cont…
• F test and t test are used to test the hypothesis.• F is a test for statistical significance of the regressionequation as a whole. It is obtained by dividing the
Linear Regression Using SPSS
• Analyze > Regression > Linear Regression > Put thedependent and independent variables > Selectappropriate statistics > Ok.
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 106/116
106
q y gexplained variance by the unexplained variance.(Given as ANOVA table)
• T test is used to see whether that a specific variable issignificant in explaining the dependant variable or not.
421 422
Logistic Regression
424
Introduction• Logistic Regression is a model used for prediction the
probability of occurrence of categorical event by fitting datainto a Logistic Curve .
• Common dichotomous dependant variables are likedisease status (healthy or ill), clinical outcome (alive ordead), treatment outcome (success or failure), utilization
health commodities (utilization or non-utilization) etc.• Application:
– Modeling for risk prediction, identification ofdeterminants and health programming,
– Controlling confounding and interacting factors.
3/1/2010
Introduction Cont……• Comparative advantage of Logistic Regression
– Fewer assumptions, – Mathematically amenable,
Logistic Regression Function
• Binary dependant variable are coded as 0 or 1.• The probablity of the distribution is equal to the proportion
of 1s in the distribution (P)
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 107/116
107
425
– Easier interpretation.
• Classification of Logistics Regression (LR):
– Binomial LR: Dependant variable is dichotomous.
– Multinomial LR: Dependant variable with more thantwo classes.
– Ordinal LR: Dependant variable with multiple andranked classes.
426
of 1s in the distribution (P).
• The logistic function associates the Independent Variable(IV) X with the probability of occurrence of the DependantVariable (DV) Y.
• The function is given as:
427
LR Function Cont…• The function is represented by S shaped “Sigmoid graph”
which is called the Logistic Curve .• Examples:
428
LR Function Cont…• Derivation of the function can be demonstrated with an ex.• Suppose, we want to predict the person’s sex based on the
person's height.• Let's say the probability of being male at a given ht is 0.9• Odds (P/1-P) of being male = 0.9/0.1 = 9• Odds of being female = 0.1/0.9 = 0.11
• However the values look asymmetrical.• Can be corrected by the application of ln.• ln(9) = 2.217 and ln(0.11) = -2.217• The over all transformation is Logit Transformation .• The log of odds is abbreviated as the Logit .
3/1/2010
LR Function Cont…Mathematically:
x p
p β α +=
−1ln
LR Function Cont…
• One of the advantages of Logistic Regression: it ispossible to compute OR from its coefficient.
• Let’s assume a researcher is interested to study the effect
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 108/116
108
429
xe p
p β α +=−1
x
x
ee
P β α
β α
+
+
+=
1
zeP −+
=1
1 nn x x x zwhere β β β α ........2211 ++=
430
of smocking as predicting variable (X) on dependantvariable lung cancer (Y).
– X can be present (X=1) or absent (X=0),
– Y can be present (Y=1) or absent (Y=0),
X Y P
Y P β α +==− =)1(1
)1(log
431
LR Function Cont…
• Hence;
• The OR = Odds of smokers ÷ Odds of non-smokers
[ ] )1()1 / 1(log β α +=== X Y odds
[ ] )0()0 / 1(log β α +=== X Y odds
α
β α
eeOR
)1(+=
β eOR =432
Assumptions of Logistic Regression• Logistic Regression has fewer assumptions than Linear
Regression: – The DV need not be normally distributed. – Normally distributed error terms are not assumed. – Error terms should not be homoscedastic for each
level of the IVs.
3/1/2010
Assumptions of LR Cont…
But it has the following assumptions:
1. Data type: A dichotomous or polytomous DV.
2. Inclusion of all relevant variables and exclusion of the
Assumptions of LR Cont…
5. No multicollinearity: As the IVs increase in correlation witheach other, the standard errors become inflated.
– A standard error > 2.0.
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 109/116
109
433
irrelevant ones: i.e. Based on scientific framework orstatistical cutoff point (P=0.3).
3. No interaction: LR doesn’t consider interaction effectsexcept when interactions are created as a variable.
4. No outliers and influential cases: Such cases can affect the
model significantly.
434
– Examining the correlations and associations b/n IVs – Tolerance and VIF.
6. No outliers and influential cases: Such cases can affectthe model significantly.
7. Large samples:
– The minimum Ratio of Valid Cases to Variablesshould be at least 10:1. The preferred ratio is 20:1.
435
Assumptions of LR Cont…8. Linearity:
– Linear relationship b/n numeric IVs & the logit of the DV.
– If not the model underestimates association, lacks power.
– Box-Tidwell Test: If there is non linearity for numeric IVX, [(X)*ln(X)] interaction term become significant in model.
436
Fitting Logistic Model to a Dataset• In Linear Regression, the fitness of the model into the
dataset is achieved through List Square Estimation(LSE).
• In Logistic Regression LSE can’t be used.
• In its place Maximum Likelihood Estimation (MLE) isused.
• MLE relies on the concept of Likelihood .
• The likelihood of a set of data is the probability of obtainingthat particular set of data, using a given model.
3/1/2010
Fitting Logistic Model Cont…
For example:• Dataset B has five cases. Observed values for Y are(1,0,1,0,1)
h d l d h b b l f f
Fitting Logistic Model Cont…
• Mathematically it is easier to work with the Log likelihood.[ ]
=
−−+=n
iii P yP y B L
1
)1ln(1)ln()(ln
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 110/116
110
437
• The model predicts the probability of occurrence of Y is 0.7(i.e. Probability of Y=1 is 0.7, and Y=0 is 0.3)
• Likelihood of B is the joint probability of predicting thecorrect observed value of Y for every case using the model.
• i.e. L (B)=(0.7)(0.3)(0.7)(0.3)(0.7)=0.03087
∏=
−
−=
n
i
yi yi
pP B L1
1
)1()(
438
• Maximum Likelihood picks the values of the modelparameters that make the data "more likely" than anyother values of the parameters would make them.
• The MLE of the parameter P is that value of P that
maximizes L or ln L.
439
Fitting Logistic Model Cont…• Iteration: Repeated testing of the data and tuning of the
model parameter to provide the best fitting equation.• Once P is determined, then and are estimated.
Probability 440
Interpretation of Reg. Coefficients
• is called the Intercept and 1, 2, and so on, are called theRegression Coefficients of x 1, x2,…, respectively.
• is the value of Z when the value of all risk factors is zero.
• A +Ve coefficient means the risk factor increases the
probability of the outcome, while a -Ve means the opposite.• A large coefficient means that the risk factor strongly
influences the probability of the outcome; while a near-zeromeans the opposite.
zeP −+
=1
1nn x x x zwhere β β β α ........2211 ++=
3/1/2010
Hypothesis Testing in Logistic Reg.
• In Logistics Regression t or F test statistic can not be used
for hypothesis testing since it has Bernoulli Distribution.• Options:
Th (l ) Lik lih d R i S i i ( 2LL)
Hypothesis Testing LR Cont….
A. Likelihood Ratio Test Statistic (-2LL):• Usually two nested models (the Full and Reduced
Models) are presented.
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 111/116
111
441
– The (log) Likelihood Ratio Statistic (-2LL),
– The Wald Test,• All test either of the following null-hypothesis:
– Ho: 1 = 2 = 3 = ………… n = 0 – Ho: Removing an IV from the model doesn’t change its
the predictive ability.
442
• Reduced model mean a model from which a variable ispurposely omitted.
• Ho: The removed variable is not significant in the model.
• -2 Log L = -2 [log L Reduced model – Log L Full model]
−=
modmod
log2 fullof L
reduced theof Lstatistic LR
443
Hypothesis Testing LR Cont….• If the full model explains the data `much better' than the
reduced model, the difference will be `large‘:
Reject the Ho that the removed variable is non-significant.
• If the reduced model explains the data as the full model,the difference will be close to 0:
Accept the Ho that the removed variable is non-significant.
• LRT ~ X 2 df = number of removed variables.
444
Hypothesis Testing LR Cont….B. Wald Statistic:
• Commonly used to test the significance of coefficients foreach independent variable.
• Ho: A particular coefficient is zero.• W ~ X 2 df of 1.
• For a particular IV if the W is significant, then theparameter associated with this variable is not zero, so thatit should be included in the model.
β β
of Variencetest Wald
2
=
3/1/2010
Pseudo R-Squares
• In Linear Regression, R 2 measures proportion of varianceof DV explained by the predictors.
• Ranges from 0-1
Pseudo R-Squares Cont….
A. Cox and Snell’s Pseudo R2
N
Intercept
ML
M L R
/ 2
2
)(
)(1 −=
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 112/116
112
445
• Logistic Regression doesn’t have an equivalent to the R 2
• However, there are varieties of Pseudo R 2 which aredesigned to simulate the real R 2.
• Common used: Cox & Snell R 2 and Nagelkerke R 2
• Pseudo R 2 doesn’t mean what R 2 exactly means in LinearRegression: Interpretation should be made with caution.
446
B. Nagelkerke Pseudo R 2
Full M L )(
N Intercept
N
Full
Intercept
M L
M L
M L
R / 2
/ 2
2)(1
)(
)(1
−
−
=
447
Goodness of Fit AnalysisA. Hosmer-Lemeshow Statistic
• The recommended test for overall fitness of a LogisticRegression model,
• A type of chi-square test but considered stronger than thetraditional chi-square test, particularly if continuouscovariates are in the model or sample size is small.
• HL statistic first sort observations in increasing order oftheir estimated event probability and divides observationsinto deciles based on the predicted probabilities.
• HL statistic ~ X 2 df of 8.448
Goodness of Fit Analysis Cont…
• Where – n j is Number of observation in the j th group – O j is Observed number of cases in the j th group
– E j is Expected number of cases in the j th group
• Non-significance means the model adequately fits the data.• P value of 0.05 is considered as level of significance.
8)1(
)( 210
1
22 of df
n
E E
E OG
j
j
j j
j j HL χ ≈
−
−=
=
3/1/2010
Goodness of Fit Analysis Cont…
B. Loglikelihood Statistics
• A good model is the one that results in a high likelihood ofthe observed results.
Logistic Regression Using SPSS
• Analyze > Regression > Binary Logistic >Put thedependent and independent variables > Mark categoricalindependent variables > check for the options > Ok.
Or
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 113/116
113
449
• This translates into a small value for -2LL.
• If a model fits perfectly, the -2LL would be 0.
• Since there is no acceptable upper cutoff point for -2LLtest, it is difficult to interpret the meaning of the score.
• Less commonly used.
• Analyze > Regression > Multinomial Logistic > Put thedependent variable > Put the independent variables asfactors or covariates depending on their nature > checkfor available options > Ok.
450
Analysis of Variance(ANOVA)
ANOVA• Used to compare mean of a quantitative variable across
different categories of a categorical variable.• The specific type is called One-way ANOVA.• If two covariates are involved it is called Two-way ANOVA.• If the categorical variable has only 2 values: 2-sample t-
test can be used.• ANOVA allows for comparison among 3 or more groups.• ANOVA is helpful because it possess a certain advantage
over a two-sample t-test.• Doing multiple two-sample t-tests would result in a largely
increased chance of committing a type I error.
452
3/1/2010
ANOVA Cont…• ANOVA functions by checking whether the differences
between the groups are significant depends on: – The difference in the means – The standard deviations of each group
ANOVA Cont…
Assumptions of ANOVA:• Each group is approximately normally distributed,• Observed data constitute independent random samples
from the respective population,
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 114/116
114
– The sample sizes• ANOVA determines P-value from the F statistic.• Hypothesis:
– H0: The means of all the groups are equal. – H1: Not all the means are equal.
• Doesn’t explain which ones differs.• Once a global difference is detected, it should be follow
up with “multiple comparisons” (Post hoc test) to identifyspecific differences. 453
from the respective population,• Standard deviations of each group are approximately
equal – Rule of thumb: ratio of largest to smallest sample
standard deviation must be less than 2:1
454
ANOVA Cont…• ANOVA is a technique whereby the total variation
present in a dataset is segregated into severalcomponents.
• Variation is the sum of the squares of the deviationsbetween a value and the mean of the value.
• Sum of square (SS) is another name for variation.
• ANOVA measures two sources of variation in the dataand compares their relative sizes. – Between group variation – Within group variation
455
ANOVA Cont…Between group variation:• Is there some variation between the groups?• Sometimes called the variation due to the factor.• Denoted SS(B) for Sum of Squares (variation) between
the groups.• Calculated as follows (given x double bar is the grand
mean):
=
−=k
iii x xn BSS
1
2)()(
=
−−+−=k
inn x xn x xn x xn BSS
1
2222
211 )(.........)()()(
456
3/1/2010
ANOVA Cont…
Within group variation :• Is there some variation within the groups?• Sometimes called the error variation as it is the variation
that can’t be explained by the factor.
ANOVA Cont…Variance:• Based on the variation (SS), variance is calculated for
both categories.• The variance is also called the Mean of the Squares and
abbre iated b MS often ith an accompan ing ariable
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 115/116
115
p y• Denoted SS(W) for Sum of Squares (variation) within
the groups.• Calculated as follows given n is the sample size for
every group.
=
−=k
iii snW SS
1
2)(1)(
2222
211 )(1........)(1)(1)( nn snsnsnW SS −−+−=
457
abbreviated by MS, often with an accompanying variableMS(B) or MS(W).
• Calculated by dividing the variation by the df• MS = SS / df• The between group df is one less than the number of
groups (k-1)
• The within group df is the sum of the individual dfs ofeach group. Or in other words it is (n-k)
458
ANOVA Cont…The F distribution:• Used as test of significance in ANOVA.• The F distribution is defined as the distribution of
(Z/n1)/(W/n2), where Z has a chi-square distribution withn1 df, W has a chi-square distribution with n2 df, and Zand W are statistically independent.
• In ANOVA F test statistic is the ratio of two samplevariances. (MSB/MSW).
• The df for the numerator are the df for the betweengroup (k-1) and the df for the denominator are the df forthe within group (n-k).
• A large F is evidence against H 0, since it indicates thatthere is more difference b/n groups than within groups.
459
ANOVA Cont…Example:• Suppose we have three groups:
– Group 1: 5.3, 6.0, 6.7 – Group 2: 5.5, 6.2, 6.4, 5.7 – Group 3: 7.5, 7.2, 7.9
• Then we computer ANOVA F statistic in the followingmanner.
460
3/1/2010
ANOVA Cont…WITHIN BETWEENdifferenc e: difference
group data - group mean group mean - overall meandata group mean plain squared plain squared
5.3 1 6.00 -0.70 0.490 -0.4 0.1946.0 1 6.00 0.00 0.000 -0.4 0.1946.7 1 6.00 0.70 0.490 -0.4 0.194
ANOVA Cont…
ANOVASource of Variation SS df MS F P-value F crit
Between Groups 5.127333 2 2.563667 10.21575 0.008394 4.737416Within Groups 1.756667 7 0.250952
8/3/2019 Biostat Lecture Note
http://slidepdf.com/reader/full/biostat-lecture-note 116/116
116
5.5 2 5.95 -0.45 0.203 -0.5 0.2406.2 2 5.95 0.25 0.063 -0.5 0.2406.4 2 5.95 0.45 0.203 -0.5 0.2405.7 2 5.95 -0.25 0.063 -0.5 0.2407.5 3 7.53 -0.03 0.001 1.1 1.1887.2 3 7.53 -0.33 0.109 1.1 1.1887.9 3 7.53 0.37 0.137 1.1 1.188
TOTAL 1.757 5.106TOTAL/df 0.25095714 2.55275
overall mean: 6.44 F = 2.5528/0.25025 = 10.21575 461
W t G oups . 5666 0. 5095
Total 6.884 9
1 less than numberof groups
number of data values -number of groups(equals df for eachgroup added together)1 less than number of individuals
(just like other situations) 462
ANOVA Using SPSS
463
• Analyze > Compare means > One way ANOVA > Putthe continuous variable under “Dependent list” > Put thecategorical variable under “Factor” > Select “Post hoc”tests > Ok. Thank You
464