36
Dan Hartshorn [email protected] Twitter: @dfhartshorn LinkedIn: https://www.linkedin.com/in/dhartshorn More About R (and Why You Should Learn it!)

CodeCamp2016.2 Intro to R

Embed Size (px)

Citation preview

Dan [email protected]: @dfhartshornLinkedIn: https://www.linkedin.com/in/dhartshornMore About R (and Why You Should Learn it!)

1

I actually went to school for Management Science (Statistics & Data Management)I have been a developer in the SQL and Web Space for longer than Im willing to admitI was a Data Scientist before there was a name for it.I have been working with the tools for some time and Im glad we are finally getting around to making this stuff work.A bit about me.

Helps get your work done quickerEnables you to do things that you couldnt do in another way.Somebody pays you to do itWhy learn any new language

In 2012 HBR called Data Science the Sexiest Job of the 21st Century https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/#Analytics have been rediscoveredNate Silver said, I think data-scientist is a sexed up term for a statistician....Statistics is a branch of science. Data scientist is slightly redundant in some way and people shouldnt berate the term statistician.It does NOT necessarily include Hadoop and Big Data.It is a set of tools and techniques to apply analytic rigor to data

Data Science

Connected datarelative expendituresCLOUDMOBILEComplex implementationsSpreadmartsSiloed dataTransactional systems

Enterprise data warehouseOLAPETLHadoopInteractive DashboardsAd hoc analysisOperational reportingMachine learningAny dataIn-memory

5

Three major trends convergingIntelligence

Cloud

Data

2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.11/14/2016 11:47 AM6

Microsoft Ignite 2016 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.11/14/2016 11:47 AM7

From data to decisions and actions

Diagnostic[Interactive Dashboards]Prescriptive[Recommendations & Automation]Predictive[Machine Learning]Descriptive[Reports]

What should I do?

What will happen?Why did it happen?What happened?

Insight

Microsoft Data Insights Summit 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.11/14/2016 11:47 AM8

R Adoption is Exploding

#1 Language for Data ScienceR Usage GrowthRexer Data Miner Survey, 2007-2015

76% of analytic professionals report using R36% select R as their primary tool

R is Cool BUTData Flows Overwhelm Open Source RIn-Memory OperationLack of ParallelismExpensive Data Movement & Duplication

Not enterprise readyInadequacy of Community SupportLack of Guaranteed Support TimelinessNo SLAs or Support models

These Shortcomings are Addressed by Corporate (Microsoft) support

What is

A domain specific language for statistics and numeric analysisA data visualization toolOpen source2.5+M users Taught in most universitiesThriving user groups worldwide8000+ free algorithms in CRANScalable to big dataNew and recent grads use it Language PlatformCommunityEcosystemRich application & platform integration

11

It was written by and for statisticians (Data Scientists)InterpretedData is managed in vectors, matrices, arrays, data frames (There are no scalar values but vector of 1)Uses Packages to get things doneThere are more than 7,000 packages to choose fromPackages are located in the Comprehensive R Archive Network or CRANUsed for analytics, graphing, machine learning,

Foundations

New York Times, June 25 2009(3 hours after Michael Jacksons death)

Credit Risk AnalysisFinancial Networks

Credit Suisse: http://blog.revolutionanalytics.com/2013/05/sheftel-on-r-on-the-trading-desk.htmlANZ: http://blog.revolutionanalytics.com/2011/08/how-anz-uses-r-for-credit-risk-analysis.htmlAmerican Century: http://blog.revolutionanalytics.com/2013/06/american-century-investments.html14

Facebook

Exploratory Data AnalysisExperimental Analysis

Generally, we use R to move fast when we get a new data set. With R, we dont need to develop custom tools or write a bunch of code. Instead, we can just go about cleaning and exploring the data. Solomon Messing, data scientist at Facebook

Housing

Crime mappingThe core innovation that Zillow offers are its advanced statistical predictive products, including the Zestimate, the Rent Zestimate and the ZHVI family of real estate indexes. By using R in production as well as research, Zillow maximizes flexibility and minimizes the latency in rolling out updates and new products.Statistical forecasting

http://strataconf.com/stratany2012/public/schedule/detail/26345 ZillowTrulia http://blog.revolutionanalytics.com/2011/06/the-residuals-of-crime.html

16

Capacity PlanningForecasting hardware purchase requirements (forecast package)Also RAM requirements for Microsoft IT

System monitoring & alertingUnderstanding user behavior (how users configure monitoring platform)Visualizing infrastructure utilization dataAbnormal login detectionCustom R packages to analyze monitoring data (time series anomaly detection)

Microsoft Azure uses R for Reliability

Microsoft R Open with Microsoft R ServerBig-data analytics and distributed computing on Linux, Hadoop and TeradataSQL Server 2016Big-data analytics integrated with SQL Server databasePowerBIComputations and charts from R scripts in dashboardsAzure ML StudioR Scripts in cloud-based Experiment workflowsVisual StudioR Tools for Visual Studio: integrated development environment for R (coming soon)HDInsightsR integrated with cloud-based Hadoop clustersCortana AnalyticsCloud-based R APIs and Virtual Machines

(Demo)Basic R Structure in the IDE

Web Server that makes R interactive and produces displays A library within RCommercial version by RStudioCreates HTML DocumentsCan be styled with CSS and Java Script

What is SHINY?

New Zealand Tourism DashboardNYT Bar OptimizerAlcohol EstimatorSome Advanced Uses of R

R is Cool BUT There are limitations Data Flows Overwhelm Open Source RIn-Memory OperationLack of ParallelismExpensive Data Movement & Duplication

Not enterprise readyInadequacy of Community SupportLack of Guaranteed Support TimelinessNo SLAs or Support models

These Shortcomings are Addressed by Corporate (Microsoft) support

Introducing Microsoft R

R from Microsoft brings

Peace of mind EfficiencySpeed and scalabilityFlexibility and agility

24

Introducing Microsoft R Server

High-performance, Scalable R100% open source RCompatible with CRAN, Bioconductor, MRAN, GitHubMassively ScalableMulti-platform Big data connectivityHybrid architecture capableChoice of IDELinux, Windows, Hadoop & Teradata, and SQL Server 2016Open Source Components Licensed Components CRANMicrosoft R OpenDistributedRScaleRConnectRDeployRIDER Server Technology

25

Microsoft R portfolioSQL ServerR ServicesRed HatSUSEHadoopTeradata

WindowsMicrosoft R portfolioCommercialCommunityR ServerR Open

2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.11/14/2016 11:47 AM26

Introducing SQL Server 2016 R servicesIncluded in SQL Server 2016Reuse and optimize existing R codeEliminate data movement

In-database deploymentMemory and disk scalabilityNo R memory limitsWrite once, deploy anywhere

Enterprise speed and scaleNear-DB analytics Parallel threading and processing Reuse SQL skills for data engineering

Cost effectiveness Scalability and choice Simplicity and agility

27

SQL Server 2016 New Capabilities:

SQL Server 2016

In-Database Execution of R, ScaleR & CRAN+ SQLIn-Database Execution of:R ScriptsT-SQL Scripts containing RMove the Work to the DataRun R From the Query ProcessorRetrieve Models, Scores, Transformed Data

2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.11/14/2016 11:47 AM28

How Does Remote Execution Work?Algorithm Master

BigData

Predictive AlgorithmAnalyze Blocks In ParallelLoad Block At A TimeDistribute Work, Compile ResultsThe Results:Even Faster ComputationLarger Data Set CapacityFewer Security ConcernsNo Data Movement, No CopiesWork

Pack and Ship Requests to Remote Environments

2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.11/14/2016 11:47 AM29

The industrys broadest R-based platform Enterprise scale atop spark, Hadoop, RDBMSs & EDWs Freedom from memory limits Choice of Windows and Linux IDEs Stable deployment Write-once-deploy-anywhere portability Investment protection Hybrid cloud evolutionMicrosoft R Server delivers

30

Read the really good piece from Computerworld http://www.computerworld.com/article/2884322/app-development/learn-r-programming-basics-with-our-pdf.html?nsdr=trueLoad it up and get to work!Download Microsoft R Open: http://mran.microsoft.comDownload R Studio (Its Free) http://rstudio.comDownload R tools for Visual Studio https://www.visualstudio.com/en-us/features/rtvs-vs.aspxR aware Editor with IntelliSense R Interactive Window with multi-line editing Debugger with Locals and Stack views Plots and Shiny apps Variable Explorer, data frame viewer Support for CRAN R, Microsoft R Open, Microsoft R Server Free and Open Source

How do I get started in R?

Really new (about a month and its still in preview)It is really cool and will let you do bunches of stuff that you cant do in other tools.Interactive WindowIntellisenseDebuggingHistoryEnhanced Plotting and Data AnalysisIts free (if you have VS)https://www.visualstudio.com/en-us/features/rtvs-vs.aspx

R Tools for Visual Studio

Enhanced Open Source R distribution100% compatible with all R-related softwareFaster performance with multi-threadingCRAN Time Machine for reproducibilityAvailable for Windows, Mac, and LinuxFree and Open Source

Download from mran.microsoft.comMicrosoft R Open

Intel MKL replaces standard BLAS/LAPACK algorithms Download and install MKL from MRANWindows and Linux platforms High-performance algorithmsPipelined operations optimized for IntelSequential ParallelUses as many threads as there are available coresControl with:setMKLthreads()

No need to change any R code

MRO: Multi-threaded performance

Benchmarks details at MRANR MRO MRO

Microsoft MRAN Getting Started: https://mran.microsoft.com/documents/getting-started/If you are moving from Excel: https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users60+ Resources for R: http://www.computerworld.com/article/2497464/business-intelligence/business-intelligence-60-r-resources-to-improve-your-data-skills.html?nsdr=true

Resources

Questions?