99
CS 109: Data Science Exploratory Data Analysis & Effective Visualizations Hanspeter Pfister pfi[email protected] Joe Blitzstein [email protected] Verena Kaynig [email protected]

03-EDA

Embed Size (px)

DESCRIPTION

EDA

Citation preview

Page 1: 03-EDA

CS 109: Data Science Exploratory Data Analysis& Effective Visualizations

Hanspeter [email protected]

Joe [email protected]

Verena [email protected]

Page 2: 03-EDA

This Week• HW0 - due today (not graded)

• HW1 - out today, due Th 9/24

Check syllabus for grading / late day / collaboration policies

• Sectioning - keep an eye on Piazza for information on how to indicate preferences

Page 4: 03-EDA

Ask an interesting question.

Get the data.

Explore the data.

Model the data.

Communicate and visualize the results.

What is the scientific goal?What would you do if you had all the data?What do you want to predict or estimate?

How were the data sampled?Which data are relevant?Are there privacy issues?

Plot the data.Are there anomalies?Are there patterns?

Build a model.Fit the model.

Validate the model.

What did we learn?Do the results make sense?

Can we tell a story?

Page 5: 03-EDA

Data ExplorationNot always sure what we are looking for (until we find it)

Page 6: 03-EDA

Example: Antibiotics Will Burtin, 1951

Page 7: 03-EDA

DataGenus, Species

Min. Inhibitory Concentration[ml/g]

+ -

Page 8: 03-EDA

What Questions?

Page 9: 03-EDA

M. Bostock, Protovisafter W. Burtin, 1951

How effective are the drugs?

If bacteria is gram positive,Penicillin & Neomycin are

most effective

If bacteria is gram negative,Neomycin is most effective

Gram Positive

Gram Negative

Page 10: 03-EDA

Wainer & Lysen, “That’s funny...”American Scientist, 2009

Adapted from Brian Schmotzer

Not a streptococcus!(realized ~30 years later)

Really a streptococcus!(realized ~20 years later)

How do the bacteria compare?

Page 11: 03-EDA

Wainer & Lysen, “That’s funny...”American Scientist, 2009

How do the bacteria compare?

Page 12: 03-EDA

“The greatest value of a picture is when it forces us to notice what we never expected to see.”

John Tukey

Exploratory Data Analysis

Page 13: 03-EDA

VisualizationTo convey information through

graphical representations of data

Page 14: 03-EDA

Visualization GoalsCommunicate (Explanatory)

Present data and ideas

Explain and inform

Provide evidence and support

Influence and persuade

Analyze (Exploratory) Explore the data

Assess a situation

Determine how to proceed

Decide what to do

Page 15: 03-EDA

Communicate

New York Times

Page 17: 03-EDA

MizBee [Meyer  et  al.  2009]  

http://www.cs.utah.edu/~miriah/mizbee

Page 18: 03-EDA

Effective Visualizations

Page 19: 03-EDA

Not Effective...

Sources: US Treasury and WHO reports

Page 20: 03-EDA

http://viz.wtf

Page 21: 03-EDA

Effective Visualizations

1. Have graphical integrity

2. Keep it simple

3. Use the right display

4. Use color strategically

5. Tell a story with data

Page 22: 03-EDA

Graphical Integrity

Page 25: 03-EDA
Page 26: 03-EDA

Scale Distortions

Page 28: 03-EDA
Page 29: 03-EDA
Page 30: 03-EDA

Keep It Simple

Page 31: 03-EDA

Edward Tufte

Page 32: 03-EDA

Maximize Data-Ink Ratio

0-$24,999 $25,000+ 0-$24,999 $25,000+

Data ink

Total ink used in graphicData-Ink Ratio =

Page 33: 03-EDA

Maximize Data-Ink Ratio

0

175

350

525

700

Males Females

0-$24,999 $25,000+ 0-$24,999 $25,000+

Data ink

Total ink used in graphicData-Ink Ratio =

Page 34: 03-EDA

Why 3D pie charts are bad

Kevin Fox

Page 35: 03-EDA

Avoid Chartjunk

ongoing, Tim Brey

Extraneous visual elements that distract from the message

Page 36: 03-EDA

Avoid Chartjunk

ongoing, Tim Brey

Page 37: 03-EDA

Avoid Chartjunk

ongoing, Tim Brey

Page 38: 03-EDA

Avoid Chartjunk

ongoing, Tim Brey

Page 39: 03-EDA

Avoid Chartjunk

ongoing, Tim Brey

Page 40: 03-EDA

Don’t!

matplotlib gallery

Excel Charts Blog

Page 41: 03-EDA

Use The Right Display

Page 43: 03-EDA

Comparisons

Page 44: 03-EDA

Bar ChartHow Much Does Beer Consumption Vary by Country?

Bottles perperson per

week

Page 45: 03-EDA

Bars vs. Lines

Zacks 1999

Page 46: 03-EDA

Nathan Yau

Page 47: 03-EDA

Trends

Page 49: 03-EDA

Proportions

Page 50: 03-EDA

Pie Charts

Page 51: 03-EDA

eagerpies.com

Page 54: 03-EDA

Don’t!

Page 55: 03-EDA

Correlations

Page 56: 03-EDA

Scatterplots

http://xkcd.com/388/

Page 57: 03-EDA

Don’t!

matplot3d tutorial

Page 58: 03-EDA

Distributions

Page 59: 03-EDA

Histogram

ggplot2

Page 60: 03-EDA

Bin Width

binwidth = 0.1 binwidth = 0.01

ggplot2

Page 61: 03-EDA

Density Plots

Page 62: 03-EDA

2D Density Plots

Page 63: 03-EDA

Seaborn Tutorial

Page 64: 03-EDA

Design ExerciseHands-On Exercise

Page 65: 03-EDA

How do you feel about doing science?

Interest Before AfterExcitedKind of interestedOKNot greatBored 12

6143038

115

402519

Table

Data courtesy of Cole Nussbaumer

Page 66: 03-EDA
Page 67: 03-EDA
Page 68: 03-EDA
Page 69: 03-EDA
Page 70: 03-EDA
Page 71: 03-EDA

After the pilot program,

68%of kids expressed interest towards science,compared to 44% going into the program.

Page 72: 03-EDA

Perceptual Effectiveness

Page 73: 03-EDA

Stephen’s Power Law, 1961

J. Bertin, 1967

Cleveland / McGill, 1984

Heer / Bostock, 2010

J. Mackinlay, 1986

Page 74: 03-EDA

How much longer?

A

B 4x

Page 75: 03-EDA

How much steeper slope?

A B

4x

Page 76: 03-EDA

How much larger area?

A B

10x

Page 77: 03-EDA

How much darker?

A B

2x

Page 78: 03-EDA

How much bigger value?

A B

2 16

4x

Page 79: 03-EDA

Most Efficient

Least Efficient

C. MulbrandonVisualizingEconomics.com

Quantitative

Ordered

Categories

}}}

Page 80: 03-EDA

Most Effective

VisualizingEconomics.com

Page 81: 03-EDA

Less Effective

VisualizingEconomics.com

Page 82: 03-EDA

Pie vs. Bar Charts

Page 83: 03-EDA

Cliff Mass

Least Effective

Page 84: 03-EDA

Use Color Strategically

Page 85: 03-EDA

Color Discriminability

Sinha 2007

Page 86: 03-EDA

Colors for Categories

Do not use more than 5-8 colors at once

Ware, “Information Visualization”

Page 87: 03-EDA

Colors for Ordinal Data

Zeilis et al, 2009, “Escaping RGBland: Selecting Colors for Statistical Graphics”

Vary luminance and saturation

Page 88: 03-EDA

Colors for Quantitative Data

Rogowitz and Treinish, Why should engineers and scientists be worried about color?

Hue(Rainbow)

Luminance

Luminance & Hue

Page 89: 03-EDA

Rainbow Colormap

Page 91: 03-EDA

Avoid Rainbow Colors!

matplotlib gallery

Page 92: 03-EDA

Color Blindness

DeuteranopeProtanope Tritanope

Based on slide from Stone

Red / green deficiencies

Blue / Yellowdeficiency

Page 93: 03-EDA

Color Blindness

Normal Protanope Deuteranope Lightness

Based on slide from Stone

Page 94: 03-EDA

Color Brewer

Cynthia Brewer, Color Use Guidelines for Data Representation

Nominal

Ordinal

Page 96: 03-EDA

Effective Visualizations

1. Have graphical integrity

2. Keep it simple

3. Use the right display

4. Use color strategically

5. Tell a story with data

Page 97: 03-EDA

Further Reading

Page 98: 03-EDA

Edward Tufte

Page 99: 03-EDA

Stephen Few