Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Lecture IData Science for Business Analytics
Thibault Vatter <[email protected]>Department of Statistics, Columbia University and HEC Lausanne, UNIL 12.04.2019
Outline
1 Course overview
2 Intro to data science
3 R
4 R workflow
5 Git
6 R markdown
T. Vatter Lecture I 12.04.2019 1 / 50
A little about me
Born and raised in Geneva
Education:
I B.Sc. Physics (EPFL, ’10)I M.Sc. Physics with minor in Financial Engineering (EPFL, ’12)I Ph.D. Statistics (HEC Lausanne, ’16)
Worked a bit as a quant in finance
Currently:
I Assistant Professor in Statistics at Columbia UniversityI Live in New York city
Hobbies:
I Flying planesI Watching bay area teams (go 49ers and Warriors!)I Beers (formerly at Satellite, now in Brooklyn micro-breweries)
T. Vatter Lecture I 12.04.2019 2 / 50
The basics
Lectures:
I 04.12, 05.03, 05.10, 05.17 (+ 05.31 for the projects)
I focus on introducing the concepts
I 8:30-10:00am + 1:15-3:00pm/3:15-4:00pm
I classroom 237, Internef building
Exercise sessions:
I focus on the assignments and project
I 10:15-12:00pm + 3:15-5:00pm
I lab room 143, Internef building
TA: Iegor Rudnytskyi, [email protected]
T. Vatter Lecture I 12.04.2019 3 / 50
Grading
3 assignments (35%) and one project (65%)I Detailed reports for each assignment and final projectI Presentation during last lecture for the project
Final gradeI According to
GRADE =
∑3i=1
HWi
3 · 35 + PR · 65
100
I HWi for i = {1, 2, 3, 3} and PR are from 0 to 100I GRADE will then be adjusted from 1 to 6
Groups of 1 or 2 membersI Email to Iegor with the group membersI One email per group is enoughI Deadline for group registration is April 19
Grades based on academic performance only!
T. Vatter Lecture I 12.04.2019 4 / 50
Learning outcomes
Manage and analyze data
Develop data products
Use data science in a business context
source: r4ds.had.co.nz
T. Vatter Lecture I 12.04.2019 5 / 50
Lectures
Date Topic
04.12 (am) Intro, R workflow, Git and RMarkdown04.12 (am) Wrangling (I) and Visualization (I)05.03 (am) Wrangling (II)05.03 (pm) Visualization (II)05.10 (am) Modeling (I)05.10 (pm) Modeling (II)05.17 (am) Interactivity, Dashboards and more05.17 (pm) Project coaching05.31 (am) Projects presentations05.31 (pm) Projects presentations
T. Vatter Lecture I 12.04.2019 6 / 50
Milestones
Date Assignment
05.05 Project proposal05.09 HW105.16 HW205.19 Project update05.23 HW305.30 Project report05.30 Project presentations
To be submitted before midnight of the due date
No late submission without medical certificate
T. Vatter Lecture I 12.04.2019 7 / 50
Course website
All lecture notes, the syllabus, assignments, and additional
resources are available at:
https://tvatter.github.io/dsfba 2019/
T. Vatter Lecture I 12.04.2019 8 / 50
Resources
R for data science
The CRAN website
Rstudio cheat sheets
Much more in the resourcessection of the course website
T. Vatter Lecture I 12.04.2019 9 / 50
Best place to look for answers?
T. Vatter Lecture I 12.04.2019 10 / 50
Outline
1 Course overview
2 Intro to data science
3 R
4 R workflow
5 Git
6 R markdown
T. Vatter Lecture I 12.04.2019 11 / 50
What is Data Science?
Wikipedia: “the extractionof knowledge from data”
Precise definition a bitunclear and controversed...
Practitioners “agree” on thecomponents of data science:
I database management
I gathering and cleaning
I exploratory analysis
I predictive modeling
I data summary andvisualization
T. Vatter Lecture I 12.04.2019 12 / 50
Applications
Some of the hiring partners of The Data Incubator
E-marketing
Recommender systems
Sport analytics
Biotechnology
Image or speech recognition
Fraud and risk detection
Social media
Credit scoring
E-commerce
Government analysis
Gaming
Price comparisons
Airline routes planing
Delivery logisticsT. Vatter Lecture I 12.04.2019 13 / 50
Technology ecosystem
source: rosebt.com
T. Vatter Lecture I 12.04.2019 14 / 50
Most popular?
source: kdnuggets.comT. Vatter Lecture I 12.04.2019 15 / 50
Outline
1 Course overview
2 Intro to data science
3 R
4 R workflow
5 Git
6 R markdown
T. Vatter Lecture I 12.04.2019 16 / 50
S and R
S
I A statistical programming language
I First appeared in 1976
I Developed by John Chambers and (in earlier versions) RickBecker and Allan Wilks of Bell Labs
I John Chambers, [the aim is] to turn ideas into software,quickly and faithfully
R
I Modern implementation of S
I First appeared in 1993
I Created by Ross Ihaka and Robert Gentleman at the Universityof Auckland, New Zealand
I Currently developed by the R Development Core Team
T. Vatter Lecture I 12.04.2019 17 / 50
Some “technical” details about R
Part of the GNU free software project
Source code written primarily in C, Fortran, and R
Available for Windows, macOS, and Linux
Multi-paradigm: object-oriented, functional, procedural
Dynamically typed
Scripting language (interpreted)
Wide variety of statistical and graphical techniques
Easily extensible through functions and packages
Read/write from/to various data sources
T. Vatter Lecture I 12.04.2019 18 / 50
What about Excel?
source: fantasyfootballanalytics.net
T. Vatter Lecture I 12.04.2019 19 / 50
Excel is great for certain things...
source: github.com/jdwilson4
T. Vatter Lecture I 12.04.2019 20 / 50
...but not everything
R’s advantages:Easier automation
Better reproducibility
Faster computation
Supports larger data sets
Reads any type of data
More powerful datamanipulation capabilities
Easier project organization
Easier to find and fix errors
Free & open source
Advanced statisticscapabilities
State-of-the-art graphics
Runs on many platforms
Anyone can contributepackages to improve itsfunctionality
T. Vatter Lecture I 12.04.2019 21 / 50
Automation and reproducibility
source: trendct.org
T. Vatter Lecture I 12.04.2019 22 / 50
How about Python?
source: python.org
T. Vatter Lecture I 12.04.2019 23 / 50
RStudio
An open-source integrated development environment (IDE)
RStudio Desktop available for Windows, macOS, and Linux
source: rstudio.com
T. Vatter Lecture I 12.04.2019 25 / 50
Outline
1 Course overview
2 Intro to data science
3 R
4 R workflow
5 Git
6 R markdown
T. Vatter Lecture I 12.04.2019 26 / 50
Two questions
What about your analysis is “real”?
Where does your analysis “live”?
T. Vatter Lecture I 12.04.2019 27 / 50
What about your analysis is “real”?
T. Vatter Lecture I 12.04.2019 28 / 50
Where does your analysis live?
The console
R scripts
RStudio projects: make it straightforward to divide yourwork into multiple contexts, each with their own workingdirectory, workspace, history, and source documents.
DEMO!
T. Vatter Lecture I 12.04.2019 29 / 50
The workflow with RStudio projects
Create an RStudio project for each data analysis project.
Keep data files there.
Keep scripts there.
Save your outputs (plots and cleaned data) there.
Only ever use relative paths, not absolute paths.
Everything you need is in one place, and cleanly separatedfrom all the other projects that you are working on.
T. Vatter Lecture I 12.04.2019 30 / 50
Outline
1 Course overview
2 Intro to data science
3 R
4 R workflow
5 Git
6 R markdown
T. Vatter Lecture I 12.04.2019 31 / 50
Version control
wikipedia: management of changes to documents, computerprograms, large web sites, and other collections of information
0
25
50
75
100
2005 2010 2015
Month
Pop
ular
ity
System
Apache Subversion
Concurrent Versions System
git
Mercurial
Perforce
Google Trends Popularity
T. Vatter Lecture I 12.04.2019 32 / 50
Git: “the stupid content tracker”
Created by Linus Torvalds in 2005, his criteria:
I Patching should take < 3 seconds
I CVS as an ex. of what not to do (in doubt, do the opposite)
I Distributed workflow
I Strong safeguards against corruption (accidental or malicious)
Maintained by Junio Hamano since 2005
Part of the GNU free software project
Source code written primarily in C, Shell, Perl, Tcl, Python
Available for Windows, macOS, and Linux
T. Vatter Lecture I 12.04.2019 33 / 50
Centralized vs distributed workflows
Version control concepts and best practices by Michael Ernst
T. Vatter Lecture I 12.04.2019 34 / 50
Github
Web-based version controlservice using git
Bug tracking, feature requests,task management, and wikis forevery project
20 million users and 57 millionrepositories (April 2017)
Private and public repos
GitHub Student Developer Pack
Create account & email user toIegor by April 19.
T. Vatter Lecture I 12.04.2019 35 / 50
RStudio projects with version control
DEMO!
Work on your assignment
Commit changes to your local repository
Push the changes to the github repo
source: github.com/datasciencelabs
T. Vatter Lecture I 12.04.2019 36 / 50
Best practices
Use a descriptive commit message
Make each commit a logical unit
Avoid indiscriminate commits
Incorporate others’ changes frequently
Share your changes frequently
Coordinate with your co-workers
Remember that the tools are line-based
Don’t commit generated files
Understand your merge tool
See “Version control concepts and best practices” by Michael Ernst
T. Vatter Lecture I 12.04.2019 37 / 50
Final comments
Last commit before midnight of due date as final submission
I If there are commits after midnight, we will take the lastcommit up to the due date at 11:59 pm as the final version
Check that the final commit is showing in your Github repo
I “I forgot to push” is not an acceptable excuse
Detailed tutorials (with lots of pictures):
I The best
I Setting-up Github
I Git and RStudio
I Github and RStudio (alternative)
I Github and RStudio (alternative 2)
T. Vatter Lecture I 12.04.2019 38 / 50
Outline
1 Course overview
2 Intro to data science
3 R
4 R workflow
5 Git
6 R markdown
T. Vatter Lecture I 12.04.2019 39 / 50
R Markdown
The two components:
Literate programming
Markdown
T. Vatter Lecture I 12.04.2019 40 / 50
Literate programming
Motivation: helps peers understand and replicate your results,find errors and suggest enhancements
Introduced by Donald Knuth
wikipedia: a program is given as an explanation of theprogram logic in a natural language, such as English,interspersed with snippets of macros and traditional sourcecode, from which a compilable source code can be generated[... It] represents a move away from writing programs in themanner and order imposed by the computer, and insteadenables programmers to develop programs in the orderdemanded by the logic and flow of their thoughts
T. Vatter Lecture I 12.04.2019 41 / 50
What does this R code do?
1 data (women)2 p l o t (women)3 f i t ← lm ( we ight ∼ he igh t , data = women)4 a b l i n e ( f i t )
T. Vatter Lecture I 12.04.2019 42 / 50
And this one?
1 # Ana l y s i s o f the ’women ’ d a t a s e t i n R2 data (women) # Load the data3 a t t a ch (women) # Attach data to path4 p l o t ( we ight ∼ h e i g h t ) # Make a s c a t t e r p l o t5 f i t ← lm ( we ight ∼ h e i g h t ) # F i t l i n e a r model6 a b l i n e ( f i t ) # Add a l i n e o f b e s t f i t to the p l o t
T. Vatter Lecture I 12.04.2019 43 / 50
Two competing “views”
Real programmers don’t comment their code. If it was hard towrite, it should be hard to understand. – unknown
ORIf you can’t write clearly, you probably don’t think nearly as well as
you think you do. – Kurt Vonnegut
T. Vatter Lecture I 12.04.2019 44 / 50
Can’t we do better?
The World Almanac and Book of Facts (1975) includes a datasetof heights (in) and weights (lbs) of 15 American women aged30–39. It is built into R:
1 data (women)
As height increases, weight appears to increases (almost) linearly:every inch in height adds approximately 3.45 lbs. This wasdetermined by fitting a simple linear regression model of weightagainst height:
1 f i t ← lm ( we ight ∼ he igh t , data = women)
T. Vatter Lecture I 12.04.2019 45 / 50
Can’t we do better? cont’d
The resulting least-squares regression line can be drawn on ascatter plot of height against weight, where the models seemsappropriate:
1 p l o t ( we ight ∼ he igh t , data = women)2 a b l i n e ( f i t )
T. Vatter Lecture I 12.04.2019 46 / 50
What is Markdown?
A lightweight markup language
Markup:
I A system for annotating a document in a way that issyntactically distinguishable from the text
I E.g., LaTeX and HyperText Markup Language (HTML)
Lightweight:
I A markup language with simple, unobtrusive syntax
I E.g., Markdown and R markdown
T. Vatter Lecture I 12.04.2019 47 / 50
Markup vs lightweight markup
Here is some text:
in italics,
in boldface.
In LaTeX:
1 Here i s some t e x t :2 \ beg in { i t em i z e }3 \ i t em i n \ t e x t i t { i t a l i c s } ,4 \ i t em i n \ t e x t b f { bo l d f a c e } .5 \end{ i t em i z e }
In Markdown:
1 Here i s some t e x t :2 ∗ i n ∗ i t a l i c s ∗ ,3 ∗ i n ∗∗ bo l d f a c e ∗∗ .
T. Vatter Lecture I 12.04.2019 48 / 50
What is R markdown?
A markdown-based literateprogramming system
DEMO!
T. Vatter Lecture I 12.04.2019 49 / 50
Useful resources
Essential: R Markdown cheat sheet
RStudio’s R markdown website
I Tutorial (to get you started)
I Output formats (e.g., HTML, Word documents, PDFs,presentations, etc.)
stuff written by Yihui
I knitr and especially its options page
I bookdown to write technical reports
I blogdown to even build your own website
T. Vatter Lecture I 12.04.2019 50 / 50