Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Effective R: Tools tips and tricks for
being a more effective data scientist
102 Wurster Hall
UC Berkeley
21 October 2014
Polls
How many of you are students?
How many have
> 1 year using R?
> 3 years?
> 5 years?
How many use R as your principal data.science tool?
How many use Python
Julia
SAS or SPSS
Spark/Scala
Java
Ever spend too much time debating which technology fits?
11/4/2014 2 KNOW YOUR DATA
Given a vector of numbers (x),
write a function (f) that returns a vector of numbers containing the product of every other number excluding the current index.
Example:> x <- c( 1, 5, 2, 8 )
> f(x)
[1] 80 16 40 10
# 5*2*8, 1*2*8, 1*5*8, 1*2*5
11/4/2014 3 KNOW YOUR DATA
Decision Patterns
Founded 2010
Bring together complementary skills for managing data:
Acquisition * Organization* Storage Access * Utilization
Our Model
Service Consulting
Accept No VC funding
Use consulting margins from to build niche products
Our Customers
Financial Services, Retail, Entertainment, Food, Communications, Defense, Environmental.
11/4/2014 4 KNOW YOUR DATA
We get to work on a
• variety of problems,
• with a variety of
technologies
• in a variety of fields
11/4/2014 5 KNOW YOUR DATA
We have to work on a
• variety of problems,
• with a variety of
technologies
• in a variety of fields
11/4/2014 6 KNOW YOUR DATA
DATA SCIENTIST OUTLOOK
11/4/2014 7 KNOW YOUR DATA
11/4/2014 8 KNOW YOUR DATA
Source: http://venturebeat.com/2013/11/11/data-scientists-needed/
11/4/2014 9 KNOW YOUR DATA
11/4/2014 10 KNOW YOUR DATA
11/4/2014 11 KNOW YOUR DATA
COMPETITION
Much of work will not be done
in traditional worker
11/4/2014 12 KNOW YOUR DATA
INNOVATION
Spoils go to those who make products
from repeatable processes
11/4/2014 13 KNOW YOUR DATA
Google Prediction API
The price for analytics is falling …
11/4/2014 14 KNOW YOUR DATA
Paradigm typed, interpreted, OO typed, int., functional, vectorized
Popularity (Tiobe) 8th Rank, -0.77% 15th Rank, +0.97%
Packages
PyPI50,321 Packages
35+ updates / day
CRAN 5,975 package
15 updates/day
351,510 70,652
Development Tools Spyder, I[P], Eclipse Rstudio, Eclipse
Data Packages
Major ML packages
Given a vector of numbers (x)
write a function (f) that returns a vector of numbers containing the product of every other integer excluding the current index.
Example:> x <- c( 1, 5, 2, 8 )
> f(x)
[1] 80 16 40 10
# 5*2*8, 1*2*8, 1*5*8, 1*2*5
Solution:
f <- function(x) prod(x) / x
11/4/2014 16 KNOW YOUR DATA
Learning CurveEasier esp. if coming from OO background
Steeper.More, dedicated functions
Code Maintainability
Better package system, fewer name clashes
Better documentationGenerally less code req’d
PerformanceHigher, extensible through Cython, C, C++
Rcpp
Code expressiveness
Hack to extend operatorsLazy evaluation
%x% syntax used widelyNon-standard evaluation
Dedicated Web Frameworks
Translucent Shiny
Feature completeness
Rmarkdown, Reproducible Research, ProjectTempate
Vendor Entrenchment
Windows Azure, Oracle, MicroStratety, Birst, Tableau
11/4/2014 17 KNOW YOUR DATA
BREAKDOWN OF CODE TO
11/4/2014 18 KNOW YOUR DATA
Data
Management
Statistical
Operations
Visualization
Presentation
Formating
Delivery
Other (Misc)
What about …
11/4/2014 19 KNOW YOUR DATA
SECRETS TO LEET CODING
11/4/2014 20 KNOW YOUR DATA
Adopt standards: Cf. Python’s PEP-8 Naming and Formating
PEP-257 Documentation
PEP-20 Readability
We do not follow Google’s coding convention
Use version control:
Github, Bitbucket, Gitlab
Best GUI: Atlassian Sourcetree
Use Agile Methods
Track issues: JIRA, Github, Gitlab
Commit early and often.
Good PM is worth every penny.
SECRETS TO LEET CODING 2
11/4/2014 21 KNOW YOUR DATA
Follow Established Development Patterns
Goal Description R Packages
Ad hoc analysis Create a process ProjectTemplate, Rmarkdown
Package Development
Create a package Rstudio, Roxygen2, devtools
Application : Interactive
Web application Javascript,Shiny, OpenCPUJavascript
Application : Automated
Code to be scheduled or called as an event
Rscript (R –e), optigrab, crontab.
Creativity is generally a bad thing
R’S NOBELS
11/4/2014 22 KNOW YOUR DATA
DATA.TABLES
Munging and data management
11/4/2014 23 KNOW YOUR DATA
https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping
MAGRITTR
Code readability, interactive programming
11/4/2014 24 KNOW YOUR DATA
FOREACH, ITERTOOLS
Scaling-out
11/4/2014 25 KNOW YOUR DATA
RCPP, GMATRIX
Increasing Performance
11/4/2014 26 KNOW YOUR DATA
CARET (CLASSIFICATION AND REGRESSION TRAINING)
For Machine Learning
11/4/2014 27 KNOW YOUR DATA
GGPLOT2
GGVIS
Visualization
11/4/2014 28 KNOW YOUR DATA
APPENDIX
11/4/2014 29 KNOW YOUR DATA