18
Exploring data Rote analysis vs. snooping Spurious correlations There’s a whole website about this What can you do? The best you can Identify scientific questions Distinguish between exploratory and confirmatory analysis 1

Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

Exploring dataRote analysis vs. snooping

Spurious correlationsThere’s a whole website about this

What can you do?The best you can

• Identify scientific questions

• Distinguish between exploratory and confirmatory analysis

1

Page 2: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

• Pre-register studies when possible

• Keep an exploration and analysis journal

• Explore predictors and responses separately at first

1 Individual variables

• Look at location and shape

• Maybe with different sets of grouping variables

• Contrasts

– Parametric vs. non-parametric

– Exploratory vs. diagnostic

– Data vs. inference

Means and standard errors

●●

10

100

A B C D E F G H

treatment

decr

ease

Means and standard deviations

2

Page 3: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

●● ●

1

10

100

A B C D E F G H

treatment

decr

ease

Means and standard deviations

●●

● ●

0

50

100

A B C D E F G H

treatment

decr

ease

Bike example

3

Page 4: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

0

50

100

150

200

Clear Cloudy Light Heavy

weather

rent

als

Standard errors

−100

0

100

200

Clear Cloudy Light Heavy

weather

rent

als

Standard errors

4

Page 5: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

−100

0

100

200

Clear Cloudy Light Heavy

weather

rent

als

Standard deviations

−200

0

200

400

600

Clear Cloudy Light Heavy

weather

rent

als

Data shape

5

Page 6: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

0

250

500

750

1000

Clear Cloudy Light Heavy

weather

rent

als

Data shape

0

250

500

750

1000

Clear Cloudy Light Heavy

weather

rent

als

Data shape

6

Page 7: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

●●

●●●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●●

●●●

●●●●

●●●●

●●

●●

●●●●

●●

●●●●●●●●

●●●

●●

●●

●●●●

●●●●

●●

●●●

●●●●●

●●●●

●●

●●

●●

●●●●

●●●●

●●●

●●●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●●●

●●●

●●●

●●●●

●●

●●●

●●●●●●

●●

●●●●●●●●

●●●

● ●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●●●●●●●

●●●●●●●●●●●●●

●●●●●

●●

●●●

●●

●●●

●●●

●●●●●●●●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●

10

1000

Clear Cloudy Light Heavy

weather

rent

als

Data shape

10

1000

Clear Cloudy Light Heavy

weather

rent

als

Data shape and weight

7

Page 8: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

0

250

500

750

1000

Clear Cloudy Light Heavy

weather

rent

als

Log scales

• In general:

– If your logged data span < 3 decades, use human-readable numbers (e.g., 10-5000kilotons per hectare)

– If not, just embrace “logs” (log10 particles per ul is from 3–8)

∗ But remember these are not physical values

• I love natural logs, but not as axis values

– Except to represent proportional difference!

2 Bivariate data

Banking

• Banking is a real thing

– Even though many examples are bogus

• Since the point is make patterns visually clear, trial-and-error is usually as good asalgorithm

– But it is worth considering

8

Page 9: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

Sunspots

sunspots data

Year

Mon

thly

sun

spot

num

bers

1750 1800 1850 1900 1950 2000

050

100

150

200

250

sunspots data

Year

Mon

thly

sun

spot

num

bers

1750 1800 1850 1900 1950

−40

0−

200

020

040

060

0

Smoking data

●●

1

2

3

4

5

6

current smoker non−current smoker

smoke

fev

Smoking data

9

Page 10: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

●●

●●

●●●

5

10

15

current smoker non−current smoker

smoke

age

Scatter plots

• Depending on how many data points you have, scatter plots may indicate relationshipsclearly

• They can often be improved with trend interpolations

– Interpolations may be particularly good for discrete responses (count or true-false)

Scatter plot

●●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

1

2

3

4

5

6

5 10 15

age

fev

10

Page 11: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

Seeing the density better

1

2

3

4

5

6

5 10 15

age

fev

Seeing the density worse

1

2

3

4

5

6

5 10 15

age

fev

n1.001.251.501.752.00

Maybe fixed

11

Page 12: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

1

2

3

4

5

6

5 10 15

age

fev

n1.001.251.501.752.00

A loess trend line

1

2

3

4

5

6

5 10 15

age

fev

n1.001.251.501.752.00

Two loess trend lines

12

Page 13: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

1

2

3

4

5

6

5 10 15

age

fev

smokecurrent smokernon−current smoker

n1.001.251.501.752.00

Many loess trend lines

female male

5 10 15 5 10 15

1

2

3

4

5

6

age

fev

Theory of loess

• Local smoother (locally flat, linear or quadratic)

• Neighborhood size given by alpha

– Points in neighborhood are weighted by distance

• Check help function for loess

13

Page 14: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

Robust methods

• Loess is local, but not robust

– Uses least squares, can respond strongly to outliers

• R is has a very flexible function called rlm to do robust fitting

– Not local

– But can be combined with splines

Fitting comparison

1

2

3

4

5

6

5 10 15

age

fev

loess

1

2

3

4

5

6

5 10 15

age

fev

rlm plus ns(3)

Density plots

• Contours

– use _density_2d() to fit a two-dimensional kernel to the density

• hexes

– use geom_hex to plot densities using hexes

– this can also be done using rectangles for data with more discrete values

14

Page 15: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

Contours

1

2

3

4

4 8 12 16

age

fev

0.025

0.050

0.075

level

Contours

1

2

3

4

4 8 12 16

age

fev

0.025

0.050

0.075

level

Contours

15

Page 16: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

1

2

3

4

4 8 12 16

age

fev

0.025

0.050

0.075

level

Hexes

1

2

3

4

5

5 10 15

age

fev

sexfemalemale

Hexes

16

Page 17: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

2

4

6

5 10 15 20

age

fev

10

20

30count

Color principles

• Use clear gradients

• If zero has a physical meaning (like density), go in just one direction

– e.g., white to blue, white to red

– If the map contrasts with a background, zero should match the background

• If there’s a natural middle, you can use blue to white to red, or something similar

3 Multiple dimensions

• Three dimensional data is a lot like two-d with densities: contour plots are good

• Pairs plots: pairs, ggpairs

Pairs example

17

Page 18: Exploring data Rote analysis vs. snooping · 2019-09-08 · Exploring data Rote analysis vs. snooping Spurious correlations ... Data shape 0 250 500 750 1000 Clear Cloudy Light Heavy

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

● ●

● ●●

●●

●●

●●

●● ●

●●

●●●● ●

●● ●● ● ●●

●● ●

●●●

●●

●●

●●

●● ●●●● ●● ●●

● ●●●●

●●●●●

●●

● ●●

●●

●●●

●●

●●

●●

● ●

●●

●●●

● ●●

●●●

●●

●●● ●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●● ●

●●

●●●

●●●●

●●● ●●

● ●

●●●●

●●●● ●

●● ●

●●●

●●

●● ●●

●● ●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●●●

●●● ●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

Corr:

−0.118

●● ●● ●

●●●● ● ●●

●● ●

●●●

●●

●●

●●

● ●●●●● ● ●●●● ●●●

●●● ●●

●●

● ●●

●●

●●●

●●

●●

●●

●●

●●

●●●

● ●●

●●●

●●

● ●●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●● ●

●●●●

●●●

●●●

●●● ●●

●●

●●●●

●●●● ●

●● ●

●●●

●●

●● ●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

Corr:

0.872

Corr:

−0.428

●●●●●

●●●●●●●

●●●

●●● ●●

●●

●●●●

●●●●●●

●●●●●

●●

●●●●

●● ●

●●

●●

●●

●●● ●

●●

●●●

●●●●

●●●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

Corr:

0.818

Corr:

−0.366

Corr:0.963

Sepal.Length Sepal.Width Petal.Length Petal.Width

Sepal.Length

Sepal.W

idthP

etal.LengthP

etal.Width

5 6 7 8 2.0 2.5 3.0 3.5 4.0 4.5 2 4 6 0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.1

0.2

0.3

0.4

2.0

2.5

3.0

3.5

4.0

4.5

2

4

6

0.0

0.5

1.0

1.5

2.0

2.5

4 Multiple factors

• Use boxplots and violin plots

• Make use of facet_wrap and facetgrid

• Use different combinations (e.g., try plots with the same info, but different factors onthe axes vs. in the colors or the facets)

c©2018, Jonathan Dushoff and Ben Bolker. May be reproduced and distributed, with this notice, for non-commercial purposes only.

18