View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Lecture 3. Data Compression for Two Variables: Scatterplots, Cross-Tabulations, and Correlation
David R. Merrell90-786 Intermediate Empirical Methods for Public Policy and
Management
Lecture 3: Agenda
Review of Lecture 2 Cross-Tabulations Comparison Bar Charts Parallel Box Plots Scatterplots Correlation Coefficients
Review of Lecture 2
Mean or Median Models for Data
Mean or Median
Complaints have reached the city manager that Tardy City is taking too long to pay its bills.
Data are days taken to pay seven bills: 34 27 64 31 30 26 35
Calculate the mean and median. What do you conclude?
Models for Data
Data = Fit + Residual Fit as a Center
Mean Median Mode
Example: Number of Stat Courses Taken by Students in 90-786
Bin Frequency Cumulative %0 1 5.26%1 15 84.21%2 2 94.74%3 1 100.00%
More 0 100.00%
Histogram
05
101520
0 1 2 3 More
Bin
Fre
qu
ency
.00%
50.00%
100.00%
150.00%
Frequency
Cumulative %
Histogram
05
101520
0 1 2 3 More
Bin
Fre
qu
ency
.00%
50.00%
100.00%
150.00%
Frequency
Cumulative %
Histogram
05
101520
0 1 2 3 More
Bin
Fre
qu
ency
.00%
50.00%
100.00%
150.00%
Frequency
Cumulative %
3210
15
10
5
0
C1
Fre
quen
cy
Summary Statistics (Excel)Mean 1.157894737Standard Error 0.138140489Median 1Mode 1Standard Deviation 0.602140432Sample Variance 0.362573099Kurtosis 4.885489992Skewness 1.659166502Range 3Minimum 0Maximum 3Sum 22Count 19Confidence Level(95.0%) 0.290222623
Summary Statistics (Minitab)Descriptive Statistics
Variable N Mean Median Tr Mean StDev SE MeanC1 19 1.158 1.000 1.118 0.602 0.138
Variable Min Max Q1 Q3C1 0.000 3.000 1.000 1.000
Measures of Error
Sum Squared Residuals Sum Absolute Residuals Percent MissesMean 6.50 7.05 100.00Median 7.00 5.00 21.05Mode 7.00 5.00 21.05
Sum of Squared Residuals
Sum of Absolute Residuals
( )X a
X a
i
i
2
Data Compression for Two Variables...And More
Two-Variable Description Cross-Tabulations Comparison Bar Charts Parallel Box Plots Scatterplots Scatterplot Matrix Correlation Coefficients
Two-Variable Description Dependent Variable Independent Variable
Nominal or OrdinalLevel of Measurement
IntervalLevel of Measurement
Nominal or OrdinalLevel of Measurement
Cross-tabulation Cross-tabulation(group inteval data)
IntervalLevel of Measurement
Table or chart Scatterplot
Structure of a Cross-Tabulation
Dependent Variable
Independent Variable Row Total
Group 1 Group 2
0 a b a + b
1 c d c + d
2 e f e + f
Column Total a + c + e b + d + f a + b + c + d + e + f
Street Repair Practices
Study street repair practices of local government
Cities and counties handle street repairs: using their own public employees
exclusively by contracting out part of the work contracting out all the work
Table 1. Street Repair: Counts
Street Repair PracticeCity CountyNo. No. Total
Only Public 966 172 1,138Public and Contracting out 396 61 457
Only Contracting out 36 8 44
Total 1,398 241 1,639
Type of Local Government
Street Repair Practices by Type of Government: Public Employees and Contracting by Cities and Counties in the United States
Table 2. Street Repair: Percents
Street Repair PracticeCity County Total
% % % Number
Only Public 69.1% 71.2% 69.4% 1,138Public and Contracting out 28.3% 25.3% 27.9% 457
Only Contracting out 2.6% 3.3% 2.7% 44
Total % 100% 100% 100%Number 1,398 241 1,639
Type of Local Government
Street Repair Practices by Type of Government: Public Employees and Contracting by Cities and Counties in the United States
Educational Achievement
Residents of Allegheny County that are in labor force
Random sample survey of Allegheny County residents in labor force in 199?
Variables: gender and highest educational achievement
Educational Achievement: Coding of Ordinal Variables
1 if grade 4 or less 2 if grades 5-7 3 if grade 8 4 if high school incomplete (9-11) 5 if high school graduate (12) 6 if technical, trade, or business after high school 7 if college/ university incomplete 8 if college/university graduate or more
Educational Achievement Table
Education Female Male Total
No. % No. % No. %
3 1 0.21% 1 0.21% 2 0.21%
4 25 5.27% 29 6.00% 54 5.64%
5 173 36.50% 137 28.36% 310 32.39%
6 49 10.34% 32 6.63% 81 8.46%7 76 16.03% 88 18.22% 164 17.14%
8 150 31.65% 196 40.58% 346 36.15%
Total 474 100.00% 483 100.00% 957 100.00%
Bar Chart
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
3 4 5 6 7 8
FemaleMale
Job Satisfaction and Income for Postal Employees
Job IncomeSatisfaction Low Medium High
Low 50% 20% 13.3%Medium 30% 53.3% 20%High 20% 26.7% 66.7%Total 100% 100% 100%
(n=200) (n=150) (n=75)
Five Number Summary
Age LocationMon Valley Pittsburgh Other
Maximum 69 71 77Upper quartile 45 43.5 47Median 36 33 37Lower quartile 27 26 29Minimum 17 16 16
Age of Allegheny County residents by location: individuals in labor force in 199?.
Parallel Box Plots
10
20
30
40
50
60
70
80
oo
oo
The Mon Valley Pittsburgh Other
Scatterplots
Creating via Excel ChartWizard Transformation of Variables Scatterplot Matrices
Scatterplot 1
$0
$10,000
$20,000
$30,000
$40,000
$50,000
$60,000
$70,000
$80,000
$90,000
$100,000
0 5 10 15 20 25 30
Sal
ary
Years employed
Scatterplot 2
$15,000
$20,000
$25,000
$30,000
$35,000
$40,000
$45,000
0 5 10 15 20 25 30
Sal
ary
Years employed
$15,000
$20,000
$25,000
$30,000
$35,000
$40,000
$45,000
0 5 10 15 20 25 30
Female employees
Male employees
Scatterplot 3
Sal
ary
Years employed
Scatterplot Matrix
Years
Salary
Age Hired
Correlation Coefficient, r
rX X Y Y
S Si i
X Y
( )( )
Properties of r
1 1
1
r
r data all on negatively sloping straight line
r = 0 data in "shot - gun" pattern
r = +1 data all on positively sloping straight line
International Adoption Visas: 1991 vs 1988
r:/academic/90-786/ Chatterjee/ Adopt.dat
International Adoption Visas
Africa 28 41 63Belize 6 4 8Bolivia 21 51 74Brazil 164 178 139Cambodia 0 59 16Canada 12 12 6Chile 252 263 176China 52 62 201
Country 1988 1991 1992
Etc.
International Adoption Visas
0
0.51
1.5
2
2.53
3.5
0 1 2 3 4
log 1988
log
199
2
Series1
9876543210
8
7
6
5
4
3
2
log 1988
log
1992
Excel Calculation of r
Use statistical function, correl Eliminate missing data values Identify X data Identify Y data Finish Value: r = 0.879098 (.88)
Minitab Calculation of r
Correlations (Pearson)
Correlation of log 1988 and log 1992 = 0.873
Next Time ...Ethics and the Value of Data
Social Value of Data Privacy Issues Confidentiality Applications in Health Care