37
Lecture 3. Data Compression for Two Variables: Scatterplots, Cross-Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods for Public Policy and Management

Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Lecture 3. Data Compression for Two Variables: Scatterplots, Cross-Tabulations, and Correlation

David R. Merrell90-786 Intermediate Empirical Methods for Public Policy and

Management

Page 2: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Lecture 3: Agenda

Review of Lecture 2 Cross-Tabulations Comparison Bar Charts Parallel Box Plots Scatterplots Correlation Coefficients

Page 3: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Review of Lecture 2

Mean or Median Models for Data

Page 4: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Mean or Median

Complaints have reached the city manager that Tardy City is taking too long to pay its bills.

Data are days taken to pay seven bills: 34 27 64 31 30 26 35

Calculate the mean and median. What do you conclude?

Page 5: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Models for Data

Data = Fit + Residual Fit as a Center

Mean Median Mode

Example: Number of Stat Courses Taken by Students in 90-786

Page 6: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Bin Frequency Cumulative %0 1 5.26%1 15 84.21%2 2 94.74%3 1 100.00%

More 0 100.00%

Histogram

05

101520

0 1 2 3 More

Bin

Fre

qu

ency

.00%

50.00%

100.00%

150.00%

Frequency

Cumulative %

Histogram

05

101520

0 1 2 3 More

Bin

Fre

qu

ency

.00%

50.00%

100.00%

150.00%

Frequency

Cumulative %

Histogram

05

101520

0 1 2 3 More

Bin

Fre

qu

ency

.00%

50.00%

100.00%

150.00%

Frequency

Cumulative %

Page 7: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

3210

15

10

5

0

C1

Fre

quen

cy

Page 8: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Summary Statistics (Excel)Mean 1.157894737Standard Error 0.138140489Median 1Mode 1Standard Deviation 0.602140432Sample Variance 0.362573099Kurtosis 4.885489992Skewness 1.659166502Range 3Minimum 0Maximum 3Sum 22Count 19Confidence Level(95.0%) 0.290222623

Page 9: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Summary Statistics (Minitab)Descriptive Statistics

Variable N Mean Median Tr Mean StDev SE MeanC1 19 1.158 1.000 1.118 0.602 0.138

Variable Min Max Q1 Q3C1 0.000 3.000 1.000 1.000

Page 10: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Measures of Error

Sum Squared Residuals Sum Absolute Residuals Percent MissesMean 6.50 7.05 100.00Median 7.00 5.00 21.05Mode 7.00 5.00 21.05

Sum of Squared Residuals

Sum of Absolute Residuals

( )X a

X a

i

i

2

Page 11: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Data Compression for Two Variables...And More

Two-Variable Description Cross-Tabulations Comparison Bar Charts Parallel Box Plots Scatterplots Scatterplot Matrix Correlation Coefficients

Page 12: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Two-Variable Description Dependent Variable Independent Variable

Nominal or OrdinalLevel of Measurement

IntervalLevel of Measurement

Nominal or OrdinalLevel of Measurement

Cross-tabulation Cross-tabulation(group inteval data)

IntervalLevel of Measurement

Table or chart Scatterplot

Page 13: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Structure of a Cross-Tabulation

Dependent Variable

Independent Variable Row Total

Group 1 Group 2

0 a b a + b

1 c d c + d

2 e f e + f

Column Total a + c + e b + d + f a + b + c + d + e + f

Page 14: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Street Repair Practices

Study street repair practices of local government

Cities and counties handle street repairs: using their own public employees

exclusively by contracting out part of the work contracting out all the work

Page 15: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Table 1. Street Repair: Counts

Street Repair PracticeCity CountyNo. No. Total

Only Public 966 172 1,138Public and Contracting out 396 61 457

Only Contracting out 36 8 44

Total 1,398 241 1,639

Type of Local Government

Street Repair Practices by Type of Government: Public Employees and Contracting by Cities and Counties in the United States

Page 16: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Table 2. Street Repair: Percents

Street Repair PracticeCity County Total

% % % Number

Only Public 69.1% 71.2% 69.4% 1,138Public and Contracting out 28.3% 25.3% 27.9% 457

Only Contracting out 2.6% 3.3% 2.7% 44

Total % 100% 100% 100%Number 1,398 241 1,639

Type of Local Government

Street Repair Practices by Type of Government: Public Employees and Contracting by Cities and Counties in the United States

Page 17: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Educational Achievement

Residents of Allegheny County that are in labor force

Random sample survey of Allegheny County residents in labor force in 199?

Variables: gender and highest educational achievement

Page 18: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Educational Achievement: Coding of Ordinal Variables

1 if grade 4 or less 2 if grades 5-7 3 if grade 8 4 if high school incomplete (9-11) 5 if high school graduate (12) 6 if technical, trade, or business after high school 7 if college/ university incomplete 8 if college/university graduate or more

Page 19: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Educational Achievement Table

Education Female Male Total

No. % No. % No. %

3 1 0.21% 1 0.21% 2 0.21%

4 25 5.27% 29 6.00% 54 5.64%

5 173 36.50% 137 28.36% 310 32.39%

6 49 10.34% 32 6.63% 81 8.46%7 76 16.03% 88 18.22% 164 17.14%

8 150 31.65% 196 40.58% 346 36.15%

Total 474 100.00% 483 100.00% 957 100.00%

Page 20: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Bar Chart

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

3 4 5 6 7 8

FemaleMale

Page 21: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Job Satisfaction and Income for Postal Employees

Job IncomeSatisfaction Low Medium High

Low 50% 20% 13.3%Medium 30% 53.3% 20%High 20% 26.7% 66.7%Total 100% 100% 100%

(n=200) (n=150) (n=75)

Page 22: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Five Number Summary

Age LocationMon Valley Pittsburgh Other

Maximum 69 71 77Upper quartile 45 43.5 47Median 36 33 37Lower quartile 27 26 29Minimum 17 16 16

Age of Allegheny County residents by location: individuals in labor force in 199?.

Page 23: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Parallel Box Plots

10

20

30

40

50

60

70

80

oo

oo

The Mon Valley Pittsburgh Other

Page 24: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Scatterplots

Creating via Excel ChartWizard Transformation of Variables Scatterplot Matrices

Page 25: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Scatterplot 1

$0

$10,000

$20,000

$30,000

$40,000

$50,000

$60,000

$70,000

$80,000

$90,000

$100,000

0 5 10 15 20 25 30

Sal

ary

Years employed

Page 26: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Scatterplot 2

$15,000

$20,000

$25,000

$30,000

$35,000

$40,000

$45,000

0 5 10 15 20 25 30

Sal

ary

Years employed

Page 27: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

$15,000

$20,000

$25,000

$30,000

$35,000

$40,000

$45,000

0 5 10 15 20 25 30

Female employees

Male employees

Scatterplot 3

Sal

ary

Years employed

Page 28: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Scatterplot Matrix

Years

Salary

Age Hired

Page 29: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Correlation Coefficient, r

rX X Y Y

S Si i

X Y

( )( )

Page 30: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Properties of r

1 1

1

r

r data all on negatively sloping straight line

r = 0 data in "shot - gun" pattern

r = +1 data all on positively sloping straight line

Page 31: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

International Adoption Visas: 1991 vs 1988

r:/academic/90-786/ Chatterjee/ Adopt.dat

Page 32: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

International Adoption Visas

Africa 28 41 63Belize 6 4 8Bolivia 21 51 74Brazil 164 178 139Cambodia 0 59 16Canada 12 12 6Chile 252 263 176China 52 62 201

Country 1988 1991 1992

Etc.

Page 33: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

International Adoption Visas

0

0.51

1.5

2

2.53

3.5

0 1 2 3 4

log 1988

log

199

2

Series1

Page 34: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

9876543210

8

7

6

5

4

3

2

log 1988

log

1992

Page 35: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Excel Calculation of r

Use statistical function, correl Eliminate missing data values Identify X data Identify Y data Finish Value: r = 0.879098 (.88)

Page 36: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Minitab Calculation of r

Correlations (Pearson)

Correlation of log 1988 and log 1992 = 0.873

Page 37: Lecture 3. Data Compression for Two Variables: Scatterplots, Cross- Tabulations, and Correlation David R. Merrell 90-786 Intermediate Empirical Methods

Next Time ...Ethics and the Value of Data

Social Value of Data Privacy Issues Confidentiality Applications in Health Care