Lecture Week 4 - Universiteit Leiden...Overview Descriptive research Describing and presenting data...

Preview:

Citation preview

Lecture Week 4 Inspecting Data: Distributions

Introduction to Research Methods & Statistics

2013 – 2014

Hemmo Smit

So next week

No lecture & workgroups

But…

Practice Test on-line (BB)

Enter data for your own research

Practice SPSS skills with own data

Overview

Descriptive research

Describing and presenting data

Frequency distributions

Graphical displays (1)

Measures of Central Tendency and Variability

Graphical displays (2): Boxplots

Read:

Leary: Chapter 6

Howell: Chapter 2

Types of descriptive research

Survey research

Demographic

research

Epidemiological

research

Attitudes, lifestyles, behaviors, problems

Patterns of basic life events: birth,

marriage, migration, death.

Occurrence of disease and death

3 types of surveys

Cross-sectional

Successive

independent

samples

Longitudinal (panel survey design)

One-shot

“cross-section” of the population

Changes over time

Different respondents each time

! Are samples comparable?

Changes over time

Same respondents more than once

! Drop out

Describing and presenting data

3 criteria for a good description:

1) Accurate

2) concise

3) comprehensible

Data can be presented in numerical and graphical format

Beware: Scale of measurement?!?

TIP: Always start with graphs

Trade-off

- Loss of information

- Possible distortion

How to describe a distribution?

)( yy )( yy

A) Overall pattern

1) Shape

- number of peaks (uni-, bi- of multi-modal)?

- symmetrical or skewed?

2) Central tendency / Location: midpoint

3) Spread: a little or a lot?

B) Deviations from the pattern

- Outliers: observations that lie far from the majority

- Tails: thick or thin?

Frequency distributions: Example

How do children recall stories?

Respondents: 25 children

Task: Tell researcher about a movie

Dependent variable: number of “and then…” statements

(see Howell, Exercise 2.1, p.55)

Raw data and frequency distributions

18 17 16 18 15

15 18 16 20 18

22 20 17 21 17

19 17 21 20 19

18 12 23 20 20

Score f P

12 1 0.04

15 2 0.08

16 2 0.08

17 4 0.16

18 5 0.20

19 2 0.08

20 5 0.20

21 2 0.08

22 1 0.04

23 1 0.04

Total 25 1.00

Table 1. # ‘and then’ statements Table 2. # ‘and then’ statements

Absolute and relative frequencies

Absolute frequency (f)

= Number of respondents with a given score

Disadvantage: hard to interpret / compare

Relative frequency (P)

= Proportion of the total with a given score (P = f / n)

Advantage: easy to interpret

Note:

0 < P < 1

P x 100 = %

SPSS: Frequencies - Menu

Analyze > Desciptive Statistics > Frequencies

SPSS: Frequencies – Dialog box

SPSS: Frequencies - Output

Grouped frequency distribution (1)

Simple frequency distributions unclear in case of:

- small number of participants in each category and/or

- variables with many categories

Solution: grouped frequency table

Distribute the raw data over K class intervals and make a new frequency distribution

Make sure all intervals are:

- exhaustive and mutually exclusive

- of equal width

Grouped frequency distribution (2)

Score f P

12-14 1 0.04

15-17 8 0.32

18-20 12 0.48

21-23 4 0.16

total 25 1.00

Rule 1: number of classes (K) = √n

Rule 2: class interval width (I) = range / number of classes

(Range (R) = highest score – lowest score)

In our example

Number of intervals = √25 = 5

Range = 23 – 12 = 11

Interval width = 11 / 5 ≈ 2 or 3

SPSS: Grouped frequency distribution (1)

SPSS: Grouped frequency distribution (2)

1

2

SPSS: Grouped frequency distribution (3)

1

2

3

SPSS: Grouped frequency distribution (4)

1

2

SPSS: Grouped frequency distribution (5)

Cumulative frequency distributions (1)

Real lower limit = lower limit – 0.5

Real upper limit = upper limit + 0.5

Midpoint = upper limit + lower limit / 2

Class

interval

Real

lower

limit

Real

upper

limit

Midpoint f P F

12-14 11.5 14.5 13

15-17 14.5 17.5 16

18-20 17.5 20.5 19

21-23 20.5 23.5 22

Total

Cumulative frequency distributions (2)

F = Cumulative Relative Frequency (CRF): add all previous proportions.

Class

interval

Real

lower

limit

Real

upper

limit

Midpoint f P F

12-14 11.5 14.5 13 1 0.04

15-17 14.5 17.5 16 8 0.32

18-20 17.5 20.5 19 12 0.48

21-23 20.5 23.5 22 4 0.16

Total 25 1.00

Cumulative frequency distributions (3)

NB. Also possible: cumulative absolute frequency

Class

interval

Real

lower

limit

Real

upper

limit

Midpoint f P F

12-14 11.5 14.5 13 1 0.04 0.04

15-17 14.5 17.5 16 8 0.32 0.36

18-20 17.5 20.5 19 12 0.48 0.84

21-23 20.5 23.5 22 4 0.16 1.00

Total 25 1.00

Cumulative frequency distributions (4)

)( yy )( yy

The cumulative relative frequency polygon graphs

the possibility that someone has a score of X or lower.

Graphical displays: Nominal / Ordinal

Raw data Grouped

Bar

Pie

98765432

score

4

3

2

1

0

Co

un

t

8-96-74-52-3

score

6

4

2

0

Co

un

t

9

8

7

6

5

4

3

2

8-9

6-7

4-5

2-3

Graphical displays: Interval

Freq. Stem & Leaf

1,00 Extremes (=<12,0)

2,00 15. 00

2,00 16. 00

4,00 17. 0000

5,00 18. 00000

2,00 19. 00

5,00 20. 00000

2,00 21. 00

1,00 22. 0

1,00 23. 0

Stem width: 1

Each leaf: 1 case(s)

Histograms Stem & Leaf Display

Histogram – symmetrical or skewed?

Negatively skewed Positively skewed

Symmetrical

SPSS: Graphs – Chart Builder / Legacy Dialogs

SPSS: Graphs > Legacy Dialogs

SPSS - Graphs > Chart builder

3

1

2

Measures of central tendency

1. Mode (Mo) = most common score

2. Median (Mdn) = middle score (50th percentile)

3. Mean (M) = average

2

1location Median

N

in x

nx

n

xxxx

1or

...21

Central tendency and skewness

sx

sx2

Shape

Mode

Median

Mean

positive skew symmetrical negative skew

A

B

C

A

A

A

C

B

A

Measures of variability

1. Range (R) = Highest score – Lowest score

2. Interquartile range (IQR) = Q3 – Q1

3. Standard deviation (s or σ) = spread around the mean

4. Variance (s² or σ²) = spread around the mean

Variance and standard deviation

Score Deviation Squared

… … …

Sum 0 ≥ 0

xx 1

1

)(

deviation Standard

2

n

xxs

i

x

1

)(

Variance

2

2

n

xxs

i

x

1x 2

1 )( xx

2x

3x

nx

ix

xx 2

xxn

xx 3

2

2 )( xx 2

3 )( xx

2)( xxn

The standard deviation and variance are:

only suitable as measures of spread around the mean

Not robust against outliers

Five-number summary and boxplot

Five-number summary consists of:

Graphical display: Boxplot

Minimum = Lowest (non-outlying) score

Q1 = 25th percentile (25% lower, 75% higher)

Median (=Q2) = 50th percentile

Q3 = 75th percentile

Maximum = Highest (non-outlying) score

Boxplot - Example

Nummerical (five-number summary) Graphical (boxplot)

Data: 3 13 17 19 22 24 25 28 35 39 44 45 83 86 93

Q3 = 45

Max = 93

M = 28

Q1 = 19

Min = 3 IQR = 45 – 19 = 26

Q1 – 1.5*IQR = -20

Q3 + 1.5*IQR = 84

Rule of thumb

Outlier = observation

that lies 1.5 x IQR

above Q3 or below Q1.

Overview

Scale of

Measurement

Graphical CT Spread

Nominal • Bar chart

• (Pie chart)

Mode ---

Ordinal • Boxplot

Median Range

IQR

Interval

(and higher)

• Histogram

• (Stem&Leaf display)

Mean - Standard dev.

- Variance

What have you learned today?

What are the various ways to represent distributions

numerically?

What are the various ways to represent distributions

graphically?

How to describe a distribution

How to create and evaluate various numerical and

graphical representations of distributions

How to determine what numerical and graphical

representation is suitable for a variable.

Next week

No lecture and workgroups

Practice test on Blackboard

Enter your own data

Read:

Howell: Chapter 3

In two weeks

Normal distribution and standard scores