View
9
Download
0
Category
Preview:
Citation preview
CHAPTER 1: WHY STUDY STATISTICS?
Why Study Statistics?
� Population is a large (or in�nite) set of elements that are in the interest of a research question.A parameter is a speci�c characteristic of a population
�All the men living in Turkey can be a population. The average height of these men can bea population parameter
� Sample is a subset of population that we use to withdraw conclusions or predictions on theparameters of the population (for inferences to be valid, sampling should be random). Statisticsis a characteristic of the sample
� Instead of measuring the height of every man in Turkey, we can randomly select 5000 menfrom di¤erent locations of the country. This would be our sample. Then we can �nd theaverage height of these people to estimate average height of the men in Turkey. This wouldbe our sample statistics
1Ozan Eksi, TOBB-ETU
Types of Statistics
� Inferential Statistics: This is what explained above; i.e. using sample data to make estima-tion and hypothesis testing (the tools that helps us to make statements and decisions underuncertainty, incomplete information)
� Descriptive Statistics: Graphical and numerical procedures that are used to present and sum-marize data. We can use descriptive statistics on either population, or sample data
Chapter Summary
� Terms reviewed in this chapter:
� Population (Populasyon)� Parameter (Parametre)� Sample (Örneklem)� Inferential Statistics (Ǭkar¬msal ·Istatistik)� Estimation (Tahmin)� Descriptive Statistics (Betimleyici ·Istatistik)
2Ozan Eksi, TOBB-ETU
CHAPTER 2: USING GRAPHS TO DESCRIBE DATA
Data, Variable, and Constant
Data are usually just a set of numbers representing the same kind of thing, such as body weight. That"thing" is called a variable (it is variable because the numbers vary from subject to subject). If thenumbers are the same, the thing is called a constant
Classi�cation of Variables
� Categorical (sometimes called Nominal) or Numerical
�Categorical: (Yes or No), (Like, Dislike or Indi¤erent), ...
�Numerical: (Discrete: Outcome of a dice, ...), (Continuous: Height, time, ...)
� Qualitative or Quantitative
�Qualitative: These variables are measured on an ordinal, interval, or ratio scale to describevariables. Numerical identi�cation is only given to make variables categorized (Yes and Nocan be labeled as 0 and1). Ordered data indicate the rank of ordering items as well (Like,Dislike and Indi¤erent can be labelled as 2, 1,0). This thpe of data can be either categoricalor numerical
�Quantitative: They are measured on a nominal scale. Hence, numeric values matter
3Ozan Eksi, TOBB-ETU
� Independent or Dependent
� Independent: A variable that stands alone and isn�t changed by the other variables (ex.someone�s age)
�Dependent: A variable that is explained by independent variables
4Ozan Eksi, TOBB-ETU
Tables And Graphs to Describe Categorical Variables
� The Frequency Distribution Table reveals the number of occurrence (frequency) of eachpossible outcome
�A probability distribution is a frequency distribution with each frequency divided bythe total number of observations
� Bar Chart, Pie Chart and Pareto Diagram are the graphics that present the same infor-mation with the Frequency Distribution Table
�Example: Hospital Patients by Unit
Frequency Distribution Table Bar Chart Pie Chart
Hospital Unit Number of Patients
Cardiac Care 1,052Emergency 2,245Intensive Care 340Maternity 552Surgery 4,630
Hospital Patients by Unit
0
1000
2000
3000
4000
5000
Car
diac
Car
e
Emer
genc
y
Inte
nsiv
eC
are
Mat
erni
ty
Surg
ery
Num
ber o
fpa
tient
s pe
r yea
r
Hospital Patients by Unit
Emergency25%
Maternity6%
Surgery53%
Cardiac Care12%
Intensive Care4%
5Ozan Eksi, TOBB-ETU
� Pareto Diagram: It is a special Bar Chart. But unlike Bar and Pie Charts, Pareto diagrampresents the information in an order (descending or ascending), and the cumulative total isrepresented by the line
�Ex: 400 defective items are examined for cause of defectFrequency Distribution Table Arranging Data
400Total21Cracked case19Electrical Short78Paint Flaw25Missing Part
223Poor Alignment34Bad Weld
Number of defectsSource of
Manufacturing Error
400Total21Cracked case19Electrical Short78Paint Flaw25Missing Part
223Poor Alignment34Bad Weld
Number of defectsSource of
Manufacturing Error
4001921253478223
Number of defects
100%Total4.75Electrical Short5.25Cracked case6.25Missing Part8.50Bad Weld
19.50Paint Flaw55.75Poor Alignment
% of Total DefectsSource of
Manufacturing Error
4001921253478223
Number of defects
100%Total4.75Electrical Short5.25Cracked case6.25Missing Part8.50Bad Weld
19.50Paint Flaw55.75Poor Alignment
% of Total DefectsSource of
Manufacturing Error
Pareto Diagram
% o
f def
ects
in e
ach
cate
gory
(bar
gra
ph)
Pareto Diagram: Cause of Manufacturing Defect
0%
10%
20%
30%
40%
50%
60%
Poor Alignment Paint Flaw Bad Weld Missing Part Cracked case Electrical Short0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
cumulative %(line graph)
6Ozan Eksi, TOBB-ETU
Tables And Graphs to Describe Numerical Variables
� We have frequency distribution just like the case with categorical variables. However, since thistime the data is not categorized into groups, it is better to form arti�cial groups instead ofrevealing frequency of each data point
�Ex: A manufacturer of insulation randomly selects 20 winter days and records the dailyhigh temperature: 24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27
The Ordered Data
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
The Frequency Distribution Table
Interval Frequency Relative Frequency Percentage
more than 10 but less than 20 3 .15 15
more than 20 but less than 20 6 .30 30
more than 30 but less than 40 5 .25 25
more than 40 but less than 50 4 .20 20
more than 50 but less than 60 2 .10 10
Total 20 1.00 100
7Ozan Eksi, TOBB-ETU
� Note: In this example to classify the data we used intervals of 10. However, there is no rule forthat. The decision should be case speci�c
� Note: The best graph is always the one that displays the information in the most clear andapprehensible way. There is no restriction for the type of the graph that you would use. However,remember that it may also be risky not to use standard graphs as it may lead confusion for thereaders
Histogram
� It is a graph of the (numerical) data in a frequency distribution
Interval Frequency
10 but less than 20 3
20 but less than 20 6
30 but less than 40 5
40 but less than 50 4
50 but less than 60 2
Histogram: Daily High Temperature
0
3
65
4
2
001234567
0 10 20 30 40 50 60
Freq
uenc
y
8Ozan Eksi, TOBB-ETU
The Cumulative Frequency Distribution & Ogive (graphing cumulative frequencies)
Interval Frequency Percentage Cumulative CumulativeFrequency Percentage
more than 10 but less than 20 3 15 3 15
more than 20 but less than 20 6 30 9 45
more than 30 but less than 40 5 25 14 70
more than 40 but less than 50 4 20 18 90
more than 50 but less than 60 2 10 20 100
Total 20 100
Ogive: Daily High Temperature
0
20
40
60
80
100
10 20 30 40 50 60Cum
ulat
ive
Perc
enta
ge
9Ozan Eksi, TOBB-ETU
A line chart (time-series plot)
� It is used to show the values of a variable over time (time series data)
�Time is measured on the horizontal axis
�The variable of interest is measured on the vertical axis
� An Example:
Magazine Subscriptions by Year
0
50
100
150
200
250
300
350
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Thou
sand
s of
sub
scrib
ers
� Cross-Sectional Data: Refers to data collected by observing many subjects at the same point oftime. It is collected usually for the purpose of comparison
� Time series-cross-sectional Data: Refers to data collected by observing many subjects at thesuccessive points in time
10Ozan Eksi, TOBB-ETU
The shape of the distribution
� The shape of the distribution is said to be symmetric if the observations are balanced, or evenlydistributed, about the center
Symmetric Distribution
0123456789
10
1 2 3 4 5 6 7 8 9
Freq
uenc
y
The shape of the distribution is said to be skewed if the observations are not symmetricallydistributed around the center
Negatively Skewed Distribution
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9
Freq
uenc
y
Positively Skewed Distribution
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9
Freq
uenc
y
11Ozan Eksi, TOBB-ETU
Tables and Graphs to Describe Relationship Between Variables
� Graphs illustrated so far have involved only a single variable
� When two variables exist other techniques are used:
�Categorical (Qualitative) Variables: Cross tables (or contingency tables)
�Numerical (Quantitative) Variables : Scatter plots
Cross Tables
� If there are r categories for the �rst variable (rows) and c categories for the second variable(columns), the table is called an r x c cross table
� Ex: 4 x 3 Cross Table for Investment Choices by Investor
Investment Investor A Investor B Investor C TotalCategory
Stocks 46.5 55 27.5 129Bonds 32.0 44 19.0 95CD 15.5 20 13.5 49Savings 16.0 28 7.0 51
Total 110.0 147 67.0 324
12Ozan Eksi, TOBB-ETU
Side by side bar chart
Comparing Investors
0 10 20 30 40 50 60
Stocks
Bonds
CD
Savings
Investor A Investor B Investor C
Scatter Diagrams They are used for paired observations taken from two numerical variables.One variable is measured on the vertical axis and the other variable is measured on the horizontal axis
200601955518850170421673816033146291402612523
Cost perday
Volumeper day
200601955518850170421673816033146291402612523
Cost perday
Volumeper day
Cost per Day vs. Production Volume
0
50
100
150
200
250
0 10 20 30 40 50 60 70
Volume per Day
Cost
per D
ay
13Ozan Eksi, TOBB-ETU
Chapter Summary
� Data (veri) in raw form are usually not easy to use for decision making. Some type of organizationin the form of table or graphs is needed
� Terms reviewed in this chapter:
� Variables (De¼gi̧skenler):� Categorical (Kategorik) � Numerical (Say¬sal)� Qualitative (Niteliksel) � Quantitative (Niceliksel)� Independent (Ba¼g¬ms¬z) � Dependent (Ba¼g¬ml¬)
� Ordinal scale (S¬rasal Ölçek) � Ratio scale(Oransal Ölçek)� Interval scale(Aral¬ksal Ölçek) � Nominal scale (Say¬sal Ölçek)� Line chart (Çizgisel gra�k) � Bar chart (Çubuk gra�k)� Pie chart (Dairesel Gra�k) � Pareto diagram (Pareto Diyagram¬)
� Histogram (Histogram) � Ogive (A cumulative line graph)� The Cumulative Frequency distribution (Kümülatif Frekans Da¼g¬l¬m¬)� Time Series (Zaman Serisi) � Time Series (Zaman Serisi)� Skewed (Çarp¬k Da¼g¬l¬m) � Scatter plot (Saç¬l¬m Gra�¼gi)
14Ozan Eksi, TOBB-ETU
15Ozan Eksi, TOBB-ETU
CHAPTER 3: USING NUMERICAL MEASURES TO DESCRIBE DATA
Measures of Central Tendency
� Mean: Arithmetic average of values (sum of values divided by the number of them)
� Median: Midpoint of ranked values
� Mode: Most frequently observed value in the data
�Ex: Suppose the following bicycle prices: 2.000, 100, 300, 100, 500
� The mean is: (2.000+100+300+100+500)/5=600
� The median can be found after ranking: 2.000, 500, 300, 100, 100; which is 300
� The mode is 100
�Even though the mean is the most generally used measure of central tendency, it is seenthat it is subject to outliers� that is, it is highly a¤ected from high or low values in thedata even though these values may not be very informative
�Then median is often used, since the median is not sensitive to extreme values
16Ozan Eksi, TOBB-ETU
� Note: the location of the median is n+ 1
2position in the ordered data
� If the number of values is odd, the median is the middle number
� If the number of values is even, the median is the average of the two middle numbers
� Formally, the mean (also called arithmetic mean) is
� If calculated from population of N values, the mean is denoted by � and calculated as:
� =
NPi=1
xi
N=x1 + x1 + :::+ xN
N
� If calculated from sample size of n values, the mean is denoted by �x and calculated as:
�x =
nPi=1
xi
n=x1 + x1 + :::+ xn
n
17Ozan Eksi, TOBB-ETU
Mean and Median Depending on Shape of a Distribution
Mean = MedianMean < Median Median < Mean
RightSkewedLeftSkewed Symmetric
Measures of Variability
� Measures of variation give information on the spread or variability of the data values
�Ex: Same center, di¤erent variation
18Ozan Eksi, TOBB-ETU
� There are di¤erent measures of variability. The ones we are going to discuss
�Range: Di¤erence between the largest and the smallest observations
� Interquartile Range: Eliminate high- and low-valued observations and calculate the rangeof the middle 50% of the data
�Variance: Average of squared deviations of values from the mean
� Standard Deviation: Square Root of Variance
�Coe¢ cient of Variation: Standard Deviation divided by mean (shows relative variation)
� Range: Di¤erence between the largest and the smallest observations
�Ex:
Range = 14 1 = 13
�However, it ignores the way in which data are distributed and sensitive to outliers
7 8 9 10 11 12Range = 12 7 = 5
7 8 9 10 11 12Range = 12 7 = 5
19Ozan Eksi, TOBB-ETU
� Interquartile Range: Eliminate high- and low-valued observations and calculate the range of themiddle 50% of the data
�The �rst quartile, Q1, is the value for which 25% of the observations are smaller and 75%are larger
�Q2 is the same as the median (50% are smaller, 50% are larger)
�Only 25% of the observations are greater than the third quartile
� Ex:Median
(Q2)X
maximumXminimum Q1 Q3
25% 25% 25% 25%
12 30 45 57 70
Interquartile range= 57 –30 = 27
20Ozan Eksi, TOBB-ETU
� Variance: Average of squared deviations of values from the mean
�Population mean and variance
� =
NPi=1
xi
N�2 =
NPi=1
(xi � �)2
N
� Sample mean and variance
�x =
NPi=1
xi
ns2 =
NPi=1
(xi � �x)2
n� 1
� Standard Deviation: It is square root of variance. � is the population standard deviation, and sis the sample standard deviation
21Ozan Eksi, TOBB-ETU
� Ex: Sample Data (xi): 10, 12, 14, 15, 17, 18, 18, 24
�The sample size, n = 8. The mean can be found by
�x =10 + 12 + 14 + 15 + 17 + 18 + 18 + 24
8= 16
�The standard deviation can be found by
s =
s(10� �x)2 + (12� �x)2 + :::+ (24� �x)2
n� 1 =
s(10� 16)2 + (12� 16)2 + :::+ (24� 16)2
8� 1
s =
r126
7= 4:2426 (a measure of the average scatter around the mean)
� You don�t have to rank the data to �nd variance or standard deviation
� Both measure is used for hypothesis testing for a single distribution, but cannot be used tocompare variability of di¤erent distributions
22Ozan Eksi, TOBB-ETU
� Coe¢ cient of Variation: Shows variation relative to mean, so that it measures relative variationand can be used to compare two or more sets of data measured in di¤erent units
CV = (s
�x)100%
� Ex:
� Stock A:
� Average price last year = $50
� Standard deviation = $5CV = (
5
50)100% = 10%
� Stock B:
� Average price last year = $100
� Standard deviation = $5CV = (
5
100)100% = 5%
�Both stocks have the same standard deviation, but stock B is less variable relative to itsprice
23Ozan Eksi, TOBB-ETU
More About Standard Deviation of a Distribution
� Chebyshev�s Theorem: For any distribution (not necessarily normal) with mean � and standarddeviation � , and k > 1 , the part of the observations that fall within the interval
�� k�
(i.e. k standard deviations of the mean) includes at least this much of the data
100[1� (1=k2)]%
�Ex:At least Within
(1� 1=1:52) = 55:5 % k = 1:5 (�� 1:5�)(1� 1=22) = 75 % k = 2 (�� 2�)(1� 1=32) = 88:9 % k = 3 (�� 3�)
24Ozan Eksi, TOBB-ETU
� If the data distribution is bell-shaped (normally distributed), then the interval
��� 1� contains about 68 % of the values in the population or the sample
��� 2� contains about 95 % of the values in the population or the sample
��� 3� contains about 99:7 % of the values in the population or the sample
μ
68%
1σμ±
95%
2σμ± 3σμ±
99.7%
25Ozan Eksi, TOBB-ETU
Weighted Mean and Measures of Grouped Data
� The weighted mean of a set of data is �x =
nPi=1
wixiPwi
=w1x1 + w2x2 + :::+ wnxnP
wi
where wi is the weight of the ith observation
� Can be used when data is already grouped into n classes, with wi values in the ith class
� Suppose a data set contains values m1;m2; :::;mk, occurring with frequencies f1; f2; :::fK
�Population mean and variance
� =
KPi=1
fimi
Nwhere N =
KPi=1
fi , and �2 =
KPi=1
fi(mi � �)2
N
� Sample mean and variance
�x =
NPi=1
fimi
nwhere n =
KPi=1
fi , and s2 =
KPi=1
fi(mi � �x)2
n� 1
26Ozan Eksi, TOBB-ETU
Measures of Relationships Between Variables
� The covariance measures the strength of the linear relationship between two variables
�Population covariance:
Cov(x; y) = �2xy =
NPi=1
(xi � �x)(yi � �y)
N
� Sample covariance:
Cov(x; y) = s2xy =
nPi=1
(xi � �x)(yi � �y)
n� 1
�Only concerned with the strength of the relationship
�No causal e¤ect is implied
� Interpreting Covariance:
�Cov(x; y) > 0 ) x and y tend to move in the same direction
�Cov(x; y) < 0 ) x and y tend to move in opposite directions
�Cov(x; y) = 0 ) there is no linear relation between x and y
27Ozan Eksi, TOBB-ETU
� Coe¢ cient of Correlation measures the relative strength of the linear relationship betweentwo variables. It is relative because, unlike covariance, this measure is not a¤ected from themagnitude of data
�Population correlation coe¢ cient: � =Cov(x; y)
�x�y
� Sample correlation coe¢ cient: r =Cov(x; y)
sxsy
� It is unit free and ranges between �1 and 1. The closer to �1, the stronger the negative linearrelationship. 0 indicates no relationship between the variables of interest
Y
X X X
Y
XX
r = 1 r = .6 r = 0
r = +.3r = +1
Y
Xr = 0
28Ozan Eksi, TOBB-ETU
29Ozan Eksi, TOBB-ETU
Chapter Summary
� Terms reviewed in this chapter:
� Mean (Ortalama) � Median (Medyan, Ortanca De¼ger)� Mode (Mod, Tepe De¼geri) � Measure (Ölçü)� Range (De¼gi̧sim Aral¬¼g¬) � Variance (Varyasyon)� Interquartile Range (Yar¬-çeyreklik De¼gi̧sim Aral¬¼g¬) � Coe¢ cient of Variation (Varyasyon Katsay¬s¬)� Standard Deviation (Standart Sapma) � Weighted Mean (A¼g¬rl¬kl¬ortalama)� Covariance (Covaryasyon) � Correlation (Corelasyon)
30Ozan Eksi, TOBB-ETU
Recommended