26
ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations Research and Information Engineering Cornell October 1, 2020 1 / 21

ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

ORIE 4741: Learning with Big Messy Data

Exploratory Data Analysis

Professor Udell

Operations Research and Information EngineeringCornell

October 1, 2020

1 / 21

Page 2: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

Announcements

I If you’re taking lecture async: remember to submitparticipation post after each class!

I Sections start Wednesday. They are optional, attend anyone you prefer.Section this week is a Julia tutorial.

I Office hours: Zoom links and times are posted on coursewebsite.

I Gradescope is open for submission of hw0, due Thursday9-9-19 9:30am.

I First quiz this week! It should occupy about 20 minutes;you’ll have up to half an hour to complete it. Start itanytime between 6:15pm Thursday and 9pm Friday.

(All times ET)

2 / 21

Page 3: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

Questions from campuswire

I search for your question before posting new question

I approximate date of voter registration is fine

3 / 21

Page 4: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

Why julia?

I the two language problem

I julia is fast (JIT-compiled)

I julia has pleasant syntax, esp for linear algebra(MATLAB-like, but more principled)

I julia supports efficient parallelism (including multithreading)

I the julia ecosystem

for this class: you can use any language you’d like (which yourTAs can read), but the course staff will only support Julia.

4 / 21

Page 5: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

Topics to review

We will cover (most of) these in section, too:

I Linear algebra: invertible matrices, rank, norm, basic matrixidentities. When is a matrix invertible?

I QR factorization

I Gradients (multivariate derivative)

I Projections

I SVD

I Maximum likelihood estimation

I Union bound

I Computational complexity

5 / 21

Page 6: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

Why look at the data?

I detect errors in data

I check assumptions

I select appropriate models

I understand relationships among the features

I understand relationships between features and labels

6 / 21

Page 7: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

How to look at the data?

I inspect raw data

I summary statistics

I visualize

7 / 21

Page 8: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

American community survey

2013 ACS:

I 3M respondents, 87 economic/demographic surveyquestionsI incomeI cost of utilities (water, gas, electric)I weeks worked per yearI hours worked per weekI home ownershipI looking for workI use foodstampsI education levelI state of residenceI . . .

I 1/3 of responses missing

find it at https://people.orie.cornell.edu/mru8/orie4741/data/acs_2013.csv

8 / 21

Page 9: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

How do computers work?

on a laptop:

I hard disk: usually ≤ 500 GB

I memory (RAM): usually ≤ 16 GB

I many programs (e.g., Excel): substantially more limited

don’t load a giant file into memory.your computer will crash.

how big is ACS data?3M respondents × 100 questions = 300M numbers ≈ 300MB

9 / 21

Page 10: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

How do computers work?

on a laptop:

I hard disk: usually ≤ 500 GB

I memory (RAM): usually ≤ 16 GB

I many programs (e.g., Excel): substantially more limited

don’t load a giant file into memory.your computer will crash.

how big is ACS data?3M respondents × 100 questions = 300M numbers ≈ 300MB

9 / 21

Page 11: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

How do computers work?

on a laptop:

I hard disk: usually ≤ 500 GB

I memory (RAM): usually ≤ 16 GB

I many programs (e.g., Excel): substantially more limited

don’t load a giant file into memory.your computer will crash.

how big is ACS data?3M respondents × 100 questions = 300M numbers ≈ 300MB

9 / 21

Page 12: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

Inspect raw data

solution for large files: technology from the 70s!

bash shell:

I “how big are these files?”: ls -lh

I “show me some lines from the file”: head, tail, less

I “how many lines are in the file?”: wc -l

10 / 21

Page 13: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

American Community SurveyVariable Description Type

HHTYPE household type categoricalSTATEICP state categoricalOWNERSHP own home BooleanCOMMUSE commercial use BooleanACREHOUS house on ≥ 10 acres BooleanHHINCOME household income realCOSTELEC monthly electricity bill realCOSTWATR monthly water bill realCOSTGAS monthly gas bill realFOODSTMP on food stamps BooleanHCOVANY have health insurance BooleanSCHOOL currently in school BooleanEDUC highest level of education ordinalGRADEATT highest grade level attained ordinalEMPSTAT employment status categoricalLABFORCE in labor force BooleanCLASSWKR class of worker BooleanWKSWORK2 weeks worked per year ordinalUHRSWORK usual hours worked per week realLOOKING looking for work BooleanMIGRATE1 migration status categorical 11 / 21

Page 14: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

Julia and Jupyter

I Julia is a programming language:it parses human-readable code to machine-readable code,executes it, returns the answer

I Jupyter is a protocol for interacting with a programminglanguage.

I Jupyter stores inputs and outputs as .ipynb files.I Jupyter notebooks display inputs and outputs in a browserI JuliaBox is an interface to a webserver running Julia

how to access?

I recommended: install anaconda distribution, then julia(go to section or see section materials for details)

I not recommended: use Google Colabhttps://discourse.julialang.org/t/

julia-on-google-colab-free-gpu-accelerated-shareable-notebooks/

15319

12 / 21

Page 15: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

Julia and Jupyter

I Julia is a programming language:it parses human-readable code to machine-readable code,executes it, returns the answer

I Jupyter is a protocol for interacting with a programminglanguage.

I Jupyter stores inputs and outputs as .ipynb files.I Jupyter notebooks display inputs and outputs in a browserI JuliaBox is an interface to a webserver running Julia

how to access?

I recommended: install anaconda distribution, then julia(go to section or see section materials for details)

I not recommended: use Google Colabhttps://discourse.julialang.org/t/

julia-on-google-colab-free-gpu-accelerated-shareable-notebooks/

1531912 / 21

Page 16: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

Summary statistics

univariate

I mean, median, mode

I max, min, range

I variance

I . . .

explore via Julia + Jupyter notebook

https:

//github.com/ORIE4741/demos/blob/master/eda.ipynb

multi- (but usualy just bi-)variate

I correlation, covariance

I . . .

13 / 21

Page 17: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

Summary statistics

univariate

I mean, median, mode

I max, min, range

I variance

I . . .

explore via Julia + Jupyter notebook

https:

//github.com/ORIE4741/demos/blob/master/eda.ipynb

multi- (but usualy just bi-)variate

I correlation, covariance

I . . .13 / 21

Page 18: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

The perils of summary statistics

same mean, variance, correlation, line of best fit. . .

14 / 21

Page 19: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

The perils of summary statistics

same mean, variance, correlation, line of best fit. . .14 / 21

Page 20: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

The perils of summary statistics

15 / 21

Page 21: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

The perils of summary statistics: modern update

https:

//www.autodeskresearch.com/publications/samestats

16 / 21

Page 22: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

What to visualize?

I examples across all features (usually not)

I plot features across all examples (much more common)

17 / 21

Page 23: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

Best practices

I Always label your axes.

I Ensure all marks on plot are meaningful.

I Beware of pie charts; bar charts are often easier to read.

I Beware of line plots; if your data is not continuous, tryscatter plot instead.

I Consider which curves to plot on same axes. Makecomparisons easy!

18 / 21

Page 24: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

Beware of bad data

19 / 21

Page 25: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

Take away

I always look at (some of) your data

I decide what question you want to answer

20 / 21

Page 26: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

Questions?

https://docs.google.com/spreadsheets/d/

1vLbwi0WCOn0wU6cU_r0RHAnY7C0fDZ1F8Yq09pqYYuk

21 / 21