View
3
Download
0
Category
Preview:
Citation preview
1
DESCRIPTIVE STATISTICS
Dr Najib Majdi bin Yaacob
MD, MPH, DrPH (Epidemiology) USM
Unit of Biostatistics & Research Methodology
School of Medical Sciences
Universiti Sains Malaysia.
Content
• Introduction to statistics
• Descriptive vs. inferential statistics
• Variables
• Types of variables
• Organizing and displaying data for categorical variables
• Organizing and displaying data for categorical variables
DATA VS. STATISTIC VS. STATISTICSINTRODUCTION TO STATISTICS
Data & Statistic
• Data: A collection of items of information.
• Statistic : A summary of value of some attribute of a sample, usually but not necessarily as an estimator of some population parameter. Is calculated by applying a function to the values of the items of the sample
(Porta, M. (2014). A Dictionary of Epidemiology: Oxford University Press, USA)
Statistics
• The science of collecting, summarizing, and analyzing data. Data may or may not subject to random variation.
• The data themselves and summarizations of the data.
Porta, M. (2008). A Dictionary of Epidemiology: Oxford University Press, USA
• A Branch of applied mathematics concerned with the collection and interpretation of quantitative data and the use of probability theory to estimate population parameters.
http://www.thefreedictionary.com
Example:
• Data; ID Gender Height (m)
1 Male 1.67
2 Male 1.73
3 Female 1.61
4 Male 1.63
5 Female 1.57
6 Female 1.62
7 Female 1.53
2
Example:
• Statistic;– 4 (57.1%) Female, 3 (42.9%) Male
– Mean height = 1.62m
– Standard deviation for height = 0.06m
• Statistics– The process of calculating the statistic. How to
calculate the frequency and percentage for gender and how to calculate mean and standard deviation for height.
Why use statistics?
• Modern society concern with reading & writing
• Statistics in used to make the strongest possible conclusions from limited amount of data.
• A more thorough understanding of research literature will lead to improves patient care.
DESCRIPTIVE VS. INFERENTIAL BRANCHES OF STATISTICS
Descriptive statistics
• Describe and summarize dataset
• Involves collection, organization, analysis, interpretation and presentation of sample data
• Can be presented in tables, graphs or narrative format
Descriptive statistics
• Purpose
– Describe the characteristics of study participants
– Understand the data
– Answer the research questions in descriptive study
– Detect outliers or extreme values
How to describe this population?
3
How to describe this population?
samples
Describe samples
Descriptive statistics
– Frequency distribution
– Measures of central tendency
– Measures of dispersion
– Measures of position
– Exploratory data analysis
– Measures of shape of distribution: graphs, skewness, kurtosis
– Estimation
– Hypothesis testing reach a decision– Parametric statistics
– Non-parametric statistics (distribution free statistics)
– Modelling, predicting.
Inferential statisticsHow to make conclusion from this
population?
How to make conclusion from this population?
samples
Infer findings to population
Inferential statistics
VARIABLE
4
Variables
• Any quantity that have different values across individuals or other study units. (Porta, M. (2014). A
Dictionary of Epidemiology: Oxford University Press, USA)
• Variables• Independent
• Dependent
Variables
• Independent variable
– A variable that is hypothesized to influence an event or state (the dependent variable)
– The independent variable is not influenced by the event but may cause (or contribute to the occurrence of) the event, or contribute to change the (psychological, environmental, socioeconomic) status.
Variables
• Dependent variable
– A variable the value of which is dependent on the effect of another variable(s) – the independent variable(s) – in the relationship under study.
– A manifestation or outcome whose variation we seek to explain or account for by the influence of independent variables.
Variables
Effect of sunlight to plant growth
Variables
Effect of sunlight to plant growth
Independent variableDependent variable
Variables
Effect of sunlight to plant growth
X axis: Independent variable
Y ax
is: D
epen
den
t va
riab
le
5
Variables
• Controlled variable(s)– Everything you want to remain constant and unchanged
during the study period
– Example: Investigating effect of sunlight exposure duration (hours/day) to plant growth• Independent variable: Duration of sunlight exposure
• Dependent variable: Plant height
• Controlled variable: type of plant, size of pot, amount of water, type of soil etc.
MEASUREMENT SCALETYPES OF VARIABLES
Measurement scale
• Different types of scale are measured differently
• Knowledge about the measurement scale/data helps in deciding how to organize, analyse and present the data.
• Four fundamental scale ;– Nominal
– Ordinal
– Interval
– Ratio
Classification of data
Data
Categorical (qualitative)
Nominal Ordinal
Numerical
(quantitative)
Interval Ratio
Less info More Info
Categorical data: Nominal scales
• Names or categories, mutually exclusive
• Does not imply any ordering of responses
• Example;
– Sex: Male, Female
– Race: Malay, Chinese, Indian, Others
• Lowest and least informative level of measurement
Categorical data: Ordinal scales
• Names or categorizes which are mutually exclusive and the order is meaningful
• Example;– Severity: mild, moderate, severe
– Socioeconomic status: Low, Middle, High
• Limitation;– Can’t assume the differences between adjacent
scale values are equal
– Can’t make this assumption even if the labels are number
6
Numerical data: Interval scales
• Names or categorizes, the order is meaningful, the intervals are equal.
• Example;
– Fahrenheit temperature scale
– Celsius temperature scale
• Problem:
– No true zero point (Zero point is arbitrary)
Interval scales
Zero does not mean complete absence of temperature
Numerical data: Ratio scales
• Highest and most informative scale
• Contains the qualities of the nominal, ordinal and interval scale with the addition of an absolute zero point.
• Example:
– Amount of money
– Age
– Blood pressure
• The values were able to be multiple or divide
Ratio scales
Zero in Kelvin scale is absolute absence of thermal energy. Kelvin scale is therefore considered as ratio scale.
Numerical data
• Interval and ratio variables are sometime indistinguishable, and handled the same way in data analysis.
• Both can be converted to categorical data
• Converting numerical to categorical data causes lost of information
Summary of data types and scale measurement
Provides Nominal Ordinal Interval Ratio
Counts/frequency of distribution
Mode, median
The “order” of values is known
Can quantify the difference between each value
Can add or subtract values
Can multiple and divide values
Has “true zero”
7
ORGANIZING & DISPLAYING DATA FOR CATEGORICAL VARIABLE
Organizing & displaying data for categorical variable
• Table:
– Frequency table
• Frequency
• Relative frequency (percentage)
• Cumulative frequency (cumulative percentage)
• Graphical:
– Bar chart
– Pie chart
Frequency tableOutput from SPSS
Bar chart
Characteristics;1. Y axis represent
frequency2. X axis represent
categorical variables
3. Equal width of bars4. Bars separated by
equal gaps5. Height represent
frequency or percent
Pie chart
Characteristics;1. Size of slice
represent frequency or percent
2. Each piece of slice represent ach category
3. Combination of all slices must add up to 100%
Excellent graphical presentation of data
• Accuracy: proper data entry, not misleading, distortion or susceptible to misinterpretation
• Clarity: The ideas and concept conveyed are clearly understood
• Simplicity: Straight forward, avoid gridlines or odd lettering
• Appearance: should be appealing
• Well-designed structure: pattern highlighted, letterings are horizontal
8
ORGANIZING & DISPLAYING DATA FOR NUMERICAL DATA
Organizing & displaying data for numerical data
• Central tendency
• Dispersion
• Exploratory data analysis1. Stem & leaf displays
2. Box and whisker plots
• Frequency1. Histogram
2. Frequency polygon
3. Cumulative frequency
• Shape of distribution
Measures of central tendency
1. Mean
2. Median
3. Mode
Measures of central tendency
1. Mean
– Sample average
– Sum all values, divided by the number of values
– Sensitive to extreme values
Example:
What is the mean height of these 9 students?
n
X
X
n
i
i 1
id 1 2 3 4 5 6 7 8 9height (cm) 167 176 174 172 170 162 171 171 169
Measures of central tendency
2. Median
– Middle value
– Not sensitive to extreme value
– Used to summarize a skewed data
– When n is odd, median=[(n+1)/2]th value
– When n is even, median=average of (n/2)th and [(n/2)+1]th value
Measures of central tendency
2. Median
Example:
What is the median height of these 9 students?
id 1 2 3 4 5 6 7 8 9height (cm) 167 176 174 172 170 162 171 171 169
9
Measures of central tendency
2. Median
Example:
What is the median height of these 9 students?
N=9, median = (9+1)/2th value = 5th value
1 2 3 4 5 6 7 8 9sort 162 167 169 170 171 171 172 174 176
Measures of central tendency
3. Mode
– Observation that occur most frequently
– Less useful in describing data
Measures of dispersion
1. Range
2. Variance
3. Standard deviation
4. Coefficient of variation
5. Inter quartile range
Measures of dispersion
1. Range
– Largest value – smallest value (max-min)
– Sensitive to extreme values
Measures of dispersion
2. Variance
– Measures the amount of spread or variability of observation from mean
– The sample variance (s2)=the average of the square of the deviations about the sample mean (population variance=2)
– Not used in descriptive statistics because difficulty in interpreting a ‘square’ unit of data.
1
)(1
2
12
n
XX
s
n
i
Measures of dispersion
3. Standard deviation
– Square root of variance
– Most widely used and better measure of variability
– The smaller the value, the closer to the mean
– Sensitive to extreme values
1
)(1
2
1
n
XX
s
n
i
10
Measures of dispersion
4. Coefficient of variation
Ratio of the standard deviation to the mean
Expressed as percentage
Also known as relative standard deviation
Shows the extent of variability in relation to the mean.
X
sCoV
Hands-on
• Calculate/find the range, variance, standard deviation and coefficient of variation for numerical variables in the given data file.
(5 minutes)
id 1 2 3 4 5 6 7 8 9height (cm) 167 176 174 172 170 162 171 171 169
Measures of dispersion
4. Inter quartile range:
Data can be divided into quarter or four equal parts;
Q1=25th percentile
Q2=50th percentile
Q3=75th percentile
IQR is the distance from Q1 to Q3
Min Max
Measures of dispersion
4. Inter quartile range:
The most common inter percentile measure
Not sensitive to extreme values (outliers)
Usually described together with median in skewed distribution observation
In SPSS In SPSS
11
GRAPHICAL VISUALIZATION/ PRESENTATION FOR NUMERICAL DATA
Exploratory data analysis
1. Stem & leaf displays
2. Box and whisker plots
Exploratory data analysis
Stem & leaf displays
Allows easier identification of individual values in the sample
height Stem-and-Leaf Plot
Frequency Stem & Leaf
1.00 Extremes (=<162)2.00 16 . 795.00 17 . 011241.00 17 . 6
Stem width: 10Each leaf: 1 case(s)
id 1 2 3 4 5 6 7 8 9height (cm) 167 176 174 172 170 162 171 171 169
Exploratory data analysis
Box and whisker plots
• Graphical display of percentile
• Also known as 5 number summary plot (min, Q1, Q2, Q3, max)
• Provide information on central tendency and variability of the middle 50% of the distribution
• Box represent 25th to 75th percentile
Exploratory data analysis
Box and whisker plots
• Observation >1.5 times IQR away from the edge of the box is/are the outlier(s)
• Observation >3 times IQR away is/are the extreme outlier(s)
• Whisker are made of smallest and largest value outside the outliers
• Continuous data in multiple groups can be displayed side by side
Exploratory data analysis
Box and whisker plots
12
Exploratory data analysis
Box and whisker plots
Measures of frequency of distribution: Graphs
1. Histogram
2. Frequency polygon
3. Cumulative frequency
Measures of frequency of distribution: Graphs
Histogram
Graphical representation of the frequency distribution of a variable.
Bar height represent frequency or percent
Bar width represent the interval class
No gap between the interval class
Gives us idea of the distribution: normal distribution or skewed
Measures of frequency of distribution: Graphs
Histogram
Measures of frequency of distribution: Graphs
Frequency polygon
• A graph that displays the data using lines to connect points plotted for the frequency
• The frequency represent the heights of the vertical bars in the histogram
Measures of frequency of distribution: Graphs
Frequency polygon
13
Measures of frequency of distribution: Graphs
Cumulative frequency
Used to determine the number of observation that lie below or above a particular value
Calculated using a frequency distribution table
Can be constructed from stem and leaf plots or directly from data
Measures of frequency of distribution: Graphs
Cumulative frequency
Measures of shape of distribution
• Skewness
• Kurtosis
Measures of shape of distribution
• Skewness: • measure of asymmetry of a distribution around its
mean.
• Graphically examined by plotting normal curve on histogram
• Negative skewness: left tail is more pronounced than the right tail
• Positive skewness: right tail is more prominent than the left tail.
Measures of shape of distribution
• Skewness:
Measures of shape of distribution
• Kurtosis;• Relative peakness or flatness of a distribution
compared with the normal distribution.
• Visualised by plotting a normal curve on histogram
• Types;– Distribution with a high peak: leptokurtic
– Distribution with a flat-topped curve: platykurtic
– Normal distribution: mesokurtic
14
Measures of shape of distribution
• Kurtosis;
HOW TO PRESENT
General rule
• Can be presented in either graphical, table or text format
• Categorical variable: • n (%)
• Numerical variable:• Symmetric data: mean (standard deviation)
• Skewed data: median (IQR)
How to decide symmetric or skewed?
• Statistical
– Mean = median = mode
– Skewness
– Kurtosis
– Kolmogorov-Smirnov test (p>0.05)
– Shapiro Wilk test (P>0.05)
How to decide symmetric or skewed?
• Graphical
– Histogram
– Stem and Leaf plot
– Box and whisker plot
Table presentation
Table 1: Characteristic of study participants (n=30)
Variable Mean (SD) n (%)
Age (yrs) Sex Female Male
Race Malay Chinese Indian
Education Primary Secondary Tertiary
BMI (kg/m2) DBP (mmHg) SBP (mmHg)
*median (IQR)
15
THANK YOU.
Recommended