Lecture Week 4 - Universiteit Leiden...Overview Descriptive research Describing and presenting data...

Lecture Week 4 Inspecting Data: Distributions

Introduction to Research Methods & Statistics

2013 – 2014

Hemmo Smit

So next week

No lecture & workgroups

But…

Practice Test on-line (BB)

Enter data for your own research

Practice SPSS skills with own data

Overview

Descriptive research

Describing and presenting data

Frequency distributions

Graphical displays (1)

Measures of Central Tendency and Variability

Graphical displays (2): Boxplots

Leary: Chapter 6

Howell: Chapter 2

Types of descriptive research

Survey research

Demographic

research

Epidemiological

research

Attitudes, lifestyles, behaviors, problems

Patterns of basic life events: birth,

marriage, migration, death.

Occurrence of disease and death

3 types of surveys

Cross-sectional

Successive

independent

samples

Longitudinal (panel survey design)

One-shot

“cross-section” of the population

Changes over time

Different respondents each time

! Are samples comparable?

Changes over time

Same respondents more than once

! Drop out

Describing and presenting data

3 criteria for a good description:

1) Accurate

2) concise

3) comprehensible

Data can be presented in numerical and graphical format

Beware: Scale of measurement?!?

TIP: Always start with graphs

Trade-off

- Loss of information

- Possible distortion

How to describe a distribution?

)( yy )( yy

A) Overall pattern

1) Shape

- number of peaks (uni-, bi- of multi-modal)?

- symmetrical or skewed?

2) Central tendency / Location: midpoint

3) Spread: a little or a lot?

B) Deviations from the pattern

- Outliers: observations that lie far from the majority

- Tails: thick or thin?

Frequency distributions: Example

How do children recall stories?

Respondents: 25 children

Task: Tell researcher about a movie

Dependent variable: number of “and then…” statements

(see Howell, Exercise 2.1, p.55)

Raw data and frequency distributions

18 17 16 18 15

15 18 16 20 18

22 20 17 21 17

19 17 21 20 19

18 12 23 20 20

Score f P

12 1 0.04

15 2 0.08

16 2 0.08

17 4 0.16

18 5 0.20

19 2 0.08

20 5 0.20

21 2 0.08

22 1 0.04

23 1 0.04

Total 25 1.00

Table 1. # ‘and then’ statements Table 2. # ‘and then’ statements

Absolute and relative frequencies

Absolute frequency (f)

= Number of respondents with a given score

Disadvantage: hard to interpret / compare

Relative frequency (P)

= Proportion of the total with a given score (P = f / n)

Advantage: easy to interpret

0 < P < 1

P x 100 = %

SPSS: Frequencies - Menu

Analyze > Desciptive Statistics > Frequencies

SPSS: Frequencies – Dialog box

SPSS: Frequencies - Output

Grouped frequency distribution (1)

Simple frequency distributions unclear in case of:

- small number of participants in each category and/or

- variables with many categories

Solution: grouped frequency table

Distribute the raw data over K class intervals and make a new frequency distribution

Make sure all intervals are:

- exhaustive and mutually exclusive

- of equal width

Grouped frequency distribution (2)

Score f P

12-14 1 0.04

15-17 8 0.32

18-20 12 0.48

21-23 4 0.16

total 25 1.00

Rule 1: number of classes (K) = √n

Rule 2: class interval width (I) = range / number of classes

(Range (R) = highest score – lowest score)

In our example

Number of intervals = √25 = 5

Range = 23 – 12 = 11

Interval width = 11 / 5 ≈ 2 or 3

SPSS: Grouped frequency distribution (1)

Cumulative frequency distributions (1)

Real lower limit = lower limit – 0.5

Real upper limit = upper limit + 0.5

Midpoint = upper limit + lower limit / 2

interval

Midpoint f P F

12-14 11.5 14.5 13

15-17 14.5 17.5 16

18-20 17.5 20.5 19

21-23 20.5 23.5 22

F = Cumulative Relative Frequency (CRF): add all previous proportions.

interval

Midpoint f P F

12-14 11.5 14.5 13 1 0.04

15-17 14.5 17.5 16 8 0.32

18-20 17.5 20.5 19 12 0.48

21-23 20.5 23.5 22 4 0.16

Total 25 1.00

NB. Also possible: cumulative absolute frequency

interval

Midpoint f P F

12-14 11.5 14.5 13 1 0.04 0.04

15-17 14.5 17.5 16 8 0.32 0.36

18-20 17.5 20.5 19 12 0.48 0.84

21-23 20.5 23.5 22 4 0.16 1.00

Total 25 1.00

)( yy )( yy

The cumulative relative frequency polygon graphs

the possibility that someone has a score of X or lower.

Graphical displays: Nominal / Ordinal

Raw data Grouped

98765432

8-96-74-52-3

Graphical displays: Interval

Freq. Stem & Leaf

1,00 Extremes (=<12,0)

2,00 15. 00

2,00 16. 00

4,00 17. 0000

5,00 18. 00000

2,00 19. 00

5,00 20. 00000

2,00 21. 00

1,00 22. 0

1,00 23. 0

Stem width: 1

Each leaf: 1 case(s)

Histograms Stem & Leaf Display

Histogram – symmetrical or skewed?

Negatively skewed Positively skewed

Symmetrical

SPSS: Graphs – Chart Builder / Legacy Dialogs

SPSS: Graphs > Legacy Dialogs

SPSS - Graphs > Chart builder

Measures of central tendency

1. Mode (Mo) = most common score

2. Median (Mdn) = middle score (50th percentile)

3. Mean (M) = average

1location Median

Central tendency and skewness

Median

positive skew symmetrical negative skew

Measures of variability

1. Range (R) = Highest score – Lowest score

2. Interquartile range (IQR) = Q3 – Q1

3. Standard deviation (s or σ) = spread around the mean

4. Variance (s² or σ²) = spread around the mean

Variance and standard deviation

Score Deviation Squared

… … …

Sum 0 ≥ 0

deviation Standard

Variance

1 )( xx

2 )( xx 2

3 )( xx

2)( xxn

The standard deviation and variance are:

only suitable as measures of spread around the mean

Not robust against outliers

Five-number summary and boxplot

Five-number summary consists of:

Graphical display: Boxplot

Minimum = Lowest (non-outlying) score

Q1 = 25th percentile (25% lower, 75% higher)

Median (=Q2) = 50th percentile

Q3 = 75th percentile

Maximum = Highest (non-outlying) score

Boxplot - Example

Nummerical (five-number summary) Graphical (boxplot)

Data: 3 13 17 19 22 24 25 28 35 39 44 45 83 86 93

Q3 = 45

Max = 93

M = 28

Q1 = 19

Min = 3 IQR = 45 – 19 = 26

Q1 – 1.5*IQR = -20

Q3 + 1.5*IQR = 84

Rule of thumb

Outlier = observation

that lies 1.5 x IQR

above Q3 or below Q1.

Overview

Scale of

Measurement

Graphical CT Spread

Nominal • Bar chart

• (Pie chart)

Mode ---

Ordinal • Boxplot

Median Range

Interval

(and higher)

• Histogram

• (Stem&Leaf display)

Mean - Standard dev.

- Variance

What have you learned today?

What are the various ways to represent distributions

numerically?

What are the various ways to represent distributions

graphically?

How to describe a distribution

How to create and evaluate various numerical and

graphical representations of distributions

How to determine what numerical and graphical

representation is suitable for a variable.

Next week

No lecture and workgroups

Practice test on Blackboard

Enter your own data

Howell: Chapter 3

In two weeks

Normal distribution and standard scores

Lecture Week 4 - Universiteit Leiden...Overview Descriptive research Describing and presenting data...

Documents

Dilthey und Nietzsche - Universiteit Leiden

Huisstijlhandboek Universiteit Leiden

Lisanne Molina - Universiteit Leiden

Universiteit Leiden Computer Science

CLASS FIELD THEORY - Universiteit Leiden

4 The environment - Universiteit Leiden

Fluid loading responsiveness - Universiteit Leiden

Chronicon Moissiacense Maius - Universiteit Leiden

SHIPPING CHINA - Universiteit Leiden

9 References - Universiteit Leiden

View - Universiteit Leiden

Data Mining - Universiteit Leiden

Universiteit Leiden Opleiding Informatica & Economie

ESfePi - Universiteit Leiden

MIXED LANGUAGES - Universiteit Leiden

Western Lamaholot - Universiteit Leiden

ho - Universiteit Leiden

ICT&Onderwijs Programma universiteit Leiden

Air Navigation Services - Universiteit Leiden

Leidraad zomer 2014 Universiteit Leiden