51
Lecture I Data Science for Business Analytics Thibault Vatter <[email protected]> Department of Statistics, Columbia University and HEC Lausanne, UNIL 12.04.2019

Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter Department of Statistics, Columbia

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Lecture IData Science for Business Analytics

Thibault Vatter <[email protected]>Department of Statistics, Columbia University and HEC Lausanne, UNIL 12.04.2019

Page 2: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Outline

1 Course overview

2 Intro to data science

3 R

4 R workflow

5 Git

6 R markdown

T. Vatter Lecture I 12.04.2019 1 / 50

Page 3: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

A little about me

Born and raised in Geneva

Education:

I B.Sc. Physics (EPFL, ’10)I M.Sc. Physics with minor in Financial Engineering (EPFL, ’12)I Ph.D. Statistics (HEC Lausanne, ’16)

Worked a bit as a quant in finance

Currently:

I Assistant Professor in Statistics at Columbia UniversityI Live in New York city

Hobbies:

I Flying planesI Watching bay area teams (go 49ers and Warriors!)I Beers (formerly at Satellite, now in Brooklyn micro-breweries)

T. Vatter Lecture I 12.04.2019 2 / 50

Page 4: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

The basics

Lectures:

I 04.12, 05.03, 05.10, 05.17 (+ 05.31 for the projects)

I focus on introducing the concepts

I 8:30-10:00am + 1:15-3:00pm/3:15-4:00pm

I classroom 237, Internef building

Exercise sessions:

I focus on the assignments and project

I 10:15-12:00pm + 3:15-5:00pm

I lab room 143, Internef building

TA: Iegor Rudnytskyi, [email protected]

T. Vatter Lecture I 12.04.2019 3 / 50

Page 5: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Grading

3 assignments (35%) and one project (65%)I Detailed reports for each assignment and final projectI Presentation during last lecture for the project

Final gradeI According to

GRADE =

∑3i=1

HWi

3 · 35 + PR · 65

100

I HWi for i = {1, 2, 3, 3} and PR are from 0 to 100I GRADE will then be adjusted from 1 to 6

Groups of 1 or 2 membersI Email to Iegor with the group membersI One email per group is enoughI Deadline for group registration is April 19

Grades based on academic performance only!

T. Vatter Lecture I 12.04.2019 4 / 50

Page 6: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Learning outcomes

Manage and analyze data

Develop data products

Use data science in a business context

source: r4ds.had.co.nz

T. Vatter Lecture I 12.04.2019 5 / 50

Page 7: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Lectures

Date Topic

04.12 (am) Intro, R workflow, Git and RMarkdown04.12 (am) Wrangling (I) and Visualization (I)05.03 (am) Wrangling (II)05.03 (pm) Visualization (II)05.10 (am) Modeling (I)05.10 (pm) Modeling (II)05.17 (am) Interactivity, Dashboards and more05.17 (pm) Project coaching05.31 (am) Projects presentations05.31 (pm) Projects presentations

T. Vatter Lecture I 12.04.2019 6 / 50

Page 8: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Milestones

Date Assignment

05.05 Project proposal05.09 HW105.16 HW205.19 Project update05.23 HW305.30 Project report05.30 Project presentations

To be submitted before midnight of the due date

No late submission without medical certificate

T. Vatter Lecture I 12.04.2019 7 / 50

Page 9: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Course website

All lecture notes, the syllabus, assignments, and additional

resources are available at:

https://tvatter.github.io/dsfba 2019/

T. Vatter Lecture I 12.04.2019 8 / 50

Page 10: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Resources

R for data science

The CRAN website

Rstudio cheat sheets

Much more in the resourcessection of the course website

T. Vatter Lecture I 12.04.2019 9 / 50

Page 11: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Best place to look for answers?

T. Vatter Lecture I 12.04.2019 10 / 50

Page 12: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Outline

1 Course overview

2 Intro to data science

3 R

4 R workflow

5 Git

6 R markdown

T. Vatter Lecture I 12.04.2019 11 / 50

Page 13: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

What is Data Science?

Wikipedia: “the extractionof knowledge from data”

Precise definition a bitunclear and controversed...

Practitioners “agree” on thecomponents of data science:

I database management

I gathering and cleaning

I exploratory analysis

I predictive modeling

I data summary andvisualization

T. Vatter Lecture I 12.04.2019 12 / 50

Page 14: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Applications

Some of the hiring partners of The Data Incubator

E-marketing

Recommender systems

Sport analytics

Biotechnology

Image or speech recognition

Fraud and risk detection

Social media

Credit scoring

E-commerce

Government analysis

Gaming

Price comparisons

Airline routes planing

Delivery logisticsT. Vatter Lecture I 12.04.2019 13 / 50

Page 15: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Technology ecosystem

source: rosebt.com

T. Vatter Lecture I 12.04.2019 14 / 50

Page 16: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Most popular?

source: kdnuggets.comT. Vatter Lecture I 12.04.2019 15 / 50

Page 17: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Outline

1 Course overview

2 Intro to data science

3 R

4 R workflow

5 Git

6 R markdown

T. Vatter Lecture I 12.04.2019 16 / 50

Page 18: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

S and R

S

I A statistical programming language

I First appeared in 1976

I Developed by John Chambers and (in earlier versions) RickBecker and Allan Wilks of Bell Labs

I John Chambers, [the aim is] to turn ideas into software,quickly and faithfully

R

I Modern implementation of S

I First appeared in 1993

I Created by Ross Ihaka and Robert Gentleman at the Universityof Auckland, New Zealand

I Currently developed by the R Development Core Team

T. Vatter Lecture I 12.04.2019 17 / 50

Page 19: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Some “technical” details about R

Part of the GNU free software project

Source code written primarily in C, Fortran, and R

Available for Windows, macOS, and Linux

Multi-paradigm: object-oriented, functional, procedural

Dynamically typed

Scripting language (interpreted)

Wide variety of statistical and graphical techniques

Easily extensible through functions and packages

Read/write from/to various data sources

T. Vatter Lecture I 12.04.2019 18 / 50

Page 20: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

What about Excel?

source: fantasyfootballanalytics.net

T. Vatter Lecture I 12.04.2019 19 / 50

Page 21: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Excel is great for certain things...

source: github.com/jdwilson4

T. Vatter Lecture I 12.04.2019 20 / 50

Page 22: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

...but not everything

R’s advantages:Easier automation

Better reproducibility

Faster computation

Supports larger data sets

Reads any type of data

More powerful datamanipulation capabilities

Easier project organization

Easier to find and fix errors

Free & open source

Advanced statisticscapabilities

State-of-the-art graphics

Runs on many platforms

Anyone can contributepackages to improve itsfunctionality

T. Vatter Lecture I 12.04.2019 21 / 50

Page 23: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Automation and reproducibility

source: trendct.org

T. Vatter Lecture I 12.04.2019 22 / 50

Page 24: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

How about Python?

source: python.org

T. Vatter Lecture I 12.04.2019 23 / 50

Page 25: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

CRAN

source: cran.r-project.org

T. Vatter Lecture I 12.04.2019 24 / 50

Page 26: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

RStudio

An open-source integrated development environment (IDE)

RStudio Desktop available for Windows, macOS, and Linux

source: rstudio.com

T. Vatter Lecture I 12.04.2019 25 / 50

Page 27: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Outline

1 Course overview

2 Intro to data science

3 R

4 R workflow

5 Git

6 R markdown

T. Vatter Lecture I 12.04.2019 26 / 50

Page 28: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Two questions

What about your analysis is “real”?

Where does your analysis “live”?

T. Vatter Lecture I 12.04.2019 27 / 50

Page 29: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

What about your analysis is “real”?

T. Vatter Lecture I 12.04.2019 28 / 50

Page 30: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Where does your analysis live?

The console

R scripts

RStudio projects: make it straightforward to divide yourwork into multiple contexts, each with their own workingdirectory, workspace, history, and source documents.

DEMO!

T. Vatter Lecture I 12.04.2019 29 / 50

Page 31: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

The workflow with RStudio projects

Create an RStudio project for each data analysis project.

Keep data files there.

Keep scripts there.

Save your outputs (plots and cleaned data) there.

Only ever use relative paths, not absolute paths.

Everything you need is in one place, and cleanly separatedfrom all the other projects that you are working on.

T. Vatter Lecture I 12.04.2019 30 / 50

Page 32: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Outline

1 Course overview

2 Intro to data science

3 R

4 R workflow

5 Git

6 R markdown

T. Vatter Lecture I 12.04.2019 31 / 50

Page 33: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Version control

wikipedia: management of changes to documents, computerprograms, large web sites, and other collections of information

0

25

50

75

100

2005 2010 2015

Month

Pop

ular

ity

System

Apache Subversion

Concurrent Versions System

git

Mercurial

Perforce

Google Trends Popularity

T. Vatter Lecture I 12.04.2019 32 / 50

Page 34: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Git: “the stupid content tracker”

Created by Linus Torvalds in 2005, his criteria:

I Patching should take < 3 seconds

I CVS as an ex. of what not to do (in doubt, do the opposite)

I Distributed workflow

I Strong safeguards against corruption (accidental or malicious)

Maintained by Junio Hamano since 2005

Part of the GNU free software project

Source code written primarily in C, Shell, Perl, Tcl, Python

Available for Windows, macOS, and Linux

T. Vatter Lecture I 12.04.2019 33 / 50

Page 35: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Centralized vs distributed workflows

Version control concepts and best practices by Michael Ernst

T. Vatter Lecture I 12.04.2019 34 / 50

Page 36: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Github

Web-based version controlservice using git

Bug tracking, feature requests,task management, and wikis forevery project

20 million users and 57 millionrepositories (April 2017)

Private and public repos

GitHub Student Developer Pack

Create account & email user toIegor by April 19.

T. Vatter Lecture I 12.04.2019 35 / 50

Page 37: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

RStudio projects with version control

DEMO!

Work on your assignment

Commit changes to your local repository

Push the changes to the github repo

source: github.com/datasciencelabs

T. Vatter Lecture I 12.04.2019 36 / 50

Page 38: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Best practices

Use a descriptive commit message

Make each commit a logical unit

Avoid indiscriminate commits

Incorporate others’ changes frequently

Share your changes frequently

Coordinate with your co-workers

Remember that the tools are line-based

Don’t commit generated files

Understand your merge tool

See “Version control concepts and best practices” by Michael Ernst

T. Vatter Lecture I 12.04.2019 37 / 50

Page 39: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Final comments

Last commit before midnight of due date as final submission

I If there are commits after midnight, we will take the lastcommit up to the due date at 11:59 pm as the final version

Check that the final commit is showing in your Github repo

I “I forgot to push” is not an acceptable excuse

Detailed tutorials (with lots of pictures):

I The best

I Setting-up Github

I Git and RStudio

I Github and RStudio (alternative)

I Github and RStudio (alternative 2)

T. Vatter Lecture I 12.04.2019 38 / 50

Page 40: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Outline

1 Course overview

2 Intro to data science

3 R

4 R workflow

5 Git

6 R markdown

T. Vatter Lecture I 12.04.2019 39 / 50

Page 41: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

R Markdown

The two components:

Literate programming

Markdown

T. Vatter Lecture I 12.04.2019 40 / 50

Page 42: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Literate programming

Motivation: helps peers understand and replicate your results,find errors and suggest enhancements

Introduced by Donald Knuth

wikipedia: a program is given as an explanation of theprogram logic in a natural language, such as English,interspersed with snippets of macros and traditional sourcecode, from which a compilable source code can be generated[... It] represents a move away from writing programs in themanner and order imposed by the computer, and insteadenables programmers to develop programs in the orderdemanded by the logic and flow of their thoughts

T. Vatter Lecture I 12.04.2019 41 / 50

Page 43: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

What does this R code do?

1 data (women)2 p l o t (women)3 f i t ← lm ( we ight ∼ he igh t , data = women)4 a b l i n e ( f i t )

T. Vatter Lecture I 12.04.2019 42 / 50

Page 44: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

And this one?

1 # Ana l y s i s o f the ’women ’ d a t a s e t i n R2 data (women) # Load the data3 a t t a ch (women) # Attach data to path4 p l o t ( we ight ∼ h e i g h t ) # Make a s c a t t e r p l o t5 f i t ← lm ( we ight ∼ h e i g h t ) # F i t l i n e a r model6 a b l i n e ( f i t ) # Add a l i n e o f b e s t f i t to the p l o t

T. Vatter Lecture I 12.04.2019 43 / 50

Page 45: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Two competing “views”

Real programmers don’t comment their code. If it was hard towrite, it should be hard to understand. – unknown

ORIf you can’t write clearly, you probably don’t think nearly as well as

you think you do. – Kurt Vonnegut

T. Vatter Lecture I 12.04.2019 44 / 50

Page 46: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Can’t we do better?

The World Almanac and Book of Facts (1975) includes a datasetof heights (in) and weights (lbs) of 15 American women aged30–39. It is built into R:

1 data (women)

As height increases, weight appears to increases (almost) linearly:every inch in height adds approximately 3.45 lbs. This wasdetermined by fitting a simple linear regression model of weightagainst height:

1 f i t ← lm ( we ight ∼ he igh t , data = women)

T. Vatter Lecture I 12.04.2019 45 / 50

Page 47: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Can’t we do better? cont’d

The resulting least-squares regression line can be drawn on ascatter plot of height against weight, where the models seemsappropriate:

1 p l o t ( we ight ∼ he igh t , data = women)2 a b l i n e ( f i t )

T. Vatter Lecture I 12.04.2019 46 / 50

Page 48: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

What is Markdown?

A lightweight markup language

Markup:

I A system for annotating a document in a way that issyntactically distinguishable from the text

I E.g., LaTeX and HyperText Markup Language (HTML)

Lightweight:

I A markup language with simple, unobtrusive syntax

I E.g., Markdown and R markdown

T. Vatter Lecture I 12.04.2019 47 / 50

Page 49: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Markup vs lightweight markup

Here is some text:

in italics,

in boldface.

In LaTeX:

1 Here i s some t e x t :2 \ beg in { i t em i z e }3 \ i t em i n \ t e x t i t { i t a l i c s } ,4 \ i t em i n \ t e x t b f { bo l d f a c e } .5 \end{ i t em i z e }

In Markdown:

1 Here i s some t e x t :2 ∗ i n ∗ i t a l i c s ∗ ,3 ∗ i n ∗∗ bo l d f a c e ∗∗ .

T. Vatter Lecture I 12.04.2019 48 / 50

Page 50: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

What is R markdown?

A markdown-based literateprogramming system

DEMO!

T. Vatter Lecture I 12.04.2019 49 / 50

Page 51: Lecture I - Data Science for Business Analytics · Lecture I Data Science for Business Analytics Thibault Vatter  Department of Statistics, Columbia

Useful resources

Essential: R Markdown cheat sheet

RStudio’s R markdown website

I Tutorial (to get you started)

I Output formats (e.g., HTML, Word documents, PDFs,presentations, etc.)

stuff written by Yihui

I knitr and especially its options page

I bookdown to write technical reports

I blogdown to even build your own website

T. Vatter Lecture I 12.04.2019 50 / 50