45
No package? OK, develop it !!! Koji Makiyama Global Tokyo.R #2 2017/04/01 1

R Packages githubinstall magicfor dplyr.teradata

  • Upload
    hoxom

  • View
    603

  • Download
    1

Embed Size (px)

Citation preview

Page 1: R Packages githubinstall magicfor dplyr.teradata

No package? OK, develop it !!!

Koji Makiyama Global Tokyo.R #2

2017/04/01

1

Page 2: R Packages githubinstall magicfor dplyr.teradata

About Me

HOXO-M Inc.

President & CEO

2

Page 3: R Packages githubinstall magicfor dplyr.teradata

HOXO-M Inc.

•  Consists of –  Awesome young data scientists and –  R-Ojisan: people who too love R.

•  Our Stance: “No package? OK, develop it !!!” (なければ作る!それがホクソエムの誓い!)

3

Page 4: R Packages githubinstall magicfor dplyr.teradata

Developed Packages

•  CRAN githubinstall batade RODBCDBI densratio jpmesh magicfor

•  GitHub pforeach jaguchi RFinanceJ SparkRext easyRFM healthplanet dplyr.teradata rOpenWeatherMap etc.

4

Page 5: R Packages githubinstall magicfor dplyr.teradata

Today’s Story

•  Introduce Our Packages Picked up –  githubinstall –  magicfor –  dplyr.teradata

5

Page 6: R Packages githubinstall magicfor dplyr.teradata

githubinstall

6

Page 7: R Packages githubinstall magicfor dplyr.teradata

githubinstall

•  Install R Packages on GitHub without Developer Names

•  Using devtools >install_github("hadley/dplyr")

•  Using githubinstall >githubinstall("dplyr")

7

Page 8: R Packages githubinstall magicfor dplyr.teradata

Motivation

•  There is a great package ggfortify. •  In the past, it is not on CRAN. •  I used to install it using devtools

>install_github("sinhrks/ggfortify")

•  It is hard for me to remember the correct spelling of “sinhrks”

8

Page 9: R Packages githubinstall magicfor dplyr.teradata

Motivation

•  I wanted to install packages even if I forget who created them.

•  “No package? OK, develop it !!!”

9

Page 10: R Packages githubinstall magicfor dplyr.teradata

How Does It Work?

•  Gepuro Task Views http://rpkg.gepuro.net –  Crawling GitHub Repositories Every Day

•  Atsushi Hayakawa –  One of the Awesome –  HOXO-M

10

Page 11: R Packages githubinstall magicfor dplyr.teradata

Preparation

•  Install >install.packages("githubinstall")

•  Load Library>library(githubinstall)

11

Page 12: R Packages githubinstall magicfor dplyr.teradata

Basics

•  Install Packages Hosted on GitHub >githubinstall("package_name")

•  Example: twitter/AnomalyDetection >githubinstall("AnomalyDetection")

12

Page 13: R Packages githubinstall magicfor dplyr.teradata

Fuzzy Matching

•  Example: DiagrammeR

>githubinstall("DiagramR")Suggestion:-rich-iannone/DiagrammeRDoyouwanttoinstallthepackage?(Y/n)

13

Page 14: R Packages githubinstall magicfor dplyr.teradata

Specify Git References

•  From Branchesgithubinstall("ggplot2",ref="sf")

•  From Tags Githubinstall("ggplot2",ref="v1.1.0")

•  From Commits githubinstall("ggplot2",ref="f4398b")

14

Page 15: R Packages githubinstall magicfor dplyr.teradata

Suggest

•  To Know Repository Names w/o Install >gh_suggest("DiagramR")[1]"rich-iannone/DiagrammeR"

•  Fuzzy Search for Developer Names >gh_suggest_username("yuhui")[1]"yihui"

15

Page 16: R Packages githubinstall magicfor dplyr.teradata

List Packages•  List Packages by Developer Names •  Example: Packages Created by Hadley >hadleyverse<-gh_list_packages(username="hadley")>head(hadleyverse)usernamepackage_nametitle1hadleyRcppDateTime2hadleyS3HelpersforProgrammingwith3hadleyassertthatUserfriendlyassertionsfor4hadleybabynamesAnRpackagecontainallbab5hadleybenchBechmarkingtoolsfor6hadleybigrqueryAninterfacetoGoogle'sbig

16

Page 17: R Packages githubinstall magicfor dplyr.teradata

Search Packages•  Search Packages by Keywords •  Example: Search Packages Related Lasso >lasso_packages<-gh_search_packages("lasso")>head(lasso_packages)usernamepackage_nametitleCY-devsparseSVMSolutionPathsofSparChingChuan-Chenmilrmultiple-instancelogiFrankDfuserFusedlassoforhigh-dManuSettySeqGLSeqGLisagrouplassoPingYangChenmilrmultiple-instancelogiTaddyLabgamlrGammalassoregression

17

Page 18: R Packages githubinstall magicfor dplyr.teradata

Summary

18

•  The githubinstall package provides helper functions to install and find packages hosted on GitHub.

Page 19: R Packages githubinstall magicfor dplyr.teradata

magicfor

19

Page 20: R Packages githubinstall magicfor dplyr.teradata

magicfor•  Remember Printed Values in for Loops >magic_for(print)>for(iin1:3){+squared<-i^2+print(squared)+}>magic_result_as_vector()[1]149

20

Page 21: R Packages githubinstall magicfor dplyr.teradata

Motivation

•  Printed Values in for loops go away. >for(iin1:3){+squared<-i^2+print(squared)+} [1]1[1]4[1]9

21

Page 22: R Packages githubinstall magicfor dplyr.teradata

Motivation

•  To keep it, we need to change the code. >result<-vector("numeric",3)>for(iin1:3){+squared<-i^2+result[i]<-squared+}>result[1]149

22

Page 23: R Packages githubinstall magicfor dplyr.teradata

Motivation

•  Too much hassle to carefully do that to: –  Prepare some containers –  With the correct length and –  Add assignment statements.

•  I don’t want to do that.

•  “No package? OK, develop it !!!”

23

Page 24: R Packages githubinstall magicfor dplyr.teradata

magicfor

•  Insert one line spell magic_for() >magic_for(print)>for(iin1:3){+squared<-i^2+print(squared)+}

•  You can take the values out at later. >magic_result_as_vector()[1]149

24

Page 25: R Packages githubinstall magicfor dplyr.teradata

How Does It Work?

• Magic

25

Page 26: R Packages githubinstall magicfor dplyr.teradata

How Does It Work?

•  If you want really to know the magic, read “Advanced R.”

•  Then check out magicfor codes on GitHub.

https://github.com/hoxo-m/magicfor

26

Page 27: R Packages githubinstall magicfor dplyr.teradata

Preparation

•  Install >install.packages("magicfor")

•  Load Library>library(magicfor)

27

Page 28: R Packages githubinstall magicfor dplyr.teradata

Basics•  magic_for(func,progress,test,silent)

•  Arguments func: function to print values (e.g. print, cat) progress: whether to display a progress bar test: number of iteration for test silent: whether to suppress messages

28

Page 29: R Packages githubinstall magicfor dplyr.teradata

Choose Print Function

•  You can choose function to print values >magic_for(cat)>for(iin1:3){+squared<-i^2+cat(squared)+}>magic_result_as_vector()[1]149

•  Default print function is put().

29

Page 30: R Packages githubinstall magicfor dplyr.teradata

put()

•  put() displays values with high flexibility. >x<-2;y<-3>put(x)x:2>put(x,y)x:2,y:3>put(x,x^2,x^3)x:2,x^2:4,x^3:8>put(x,squared=x^2,cubed=x^3)x:2,squared:4,cubed:8

30

Page 31: R Packages githubinstall magicfor dplyr.teradata

magicfor with put()•  magicfor & put() are very compatible >magic_for()>for(iin1:3){+put(x=i,squared=i^2,cubed=i^3)+}>magic_result_as_dataframe(F)xsquaredcubed1111224833927

31

Page 32: R Packages githubinstall magicfor dplyr.teradata

Summary

•  The magicfor package provides a magic function to store values in for loops automatically.

32

Page 33: R Packages githubinstall magicfor dplyr.teradata

dplyr.teradata

33

Page 34: R Packages githubinstall magicfor dplyr.teradata

dplyr.teradata

•  Teradata Backend for dplyr >tera_db<-src_teradata("schema_name")>table<-tbl(tera_db,"table_name”)

>query<-count(table,gender)

>collect(query)gendern1♀1232♂456

34

Page 35: R Packages githubinstall magicfor dplyr.teradata

Motivation

•  I would like to extract data with dplyr verbs from Teradata.

•  I found teradata.dplyr package on GitHub but it is not maintained.

•  “No package? OK, develop it !!!”

35

Page 36: R Packages githubinstall magicfor dplyr.teradata

How Does it Work?

•  There is a way to add new SQL backends to dplyr.

>vignette("new-sql-backend")

36

Page 37: R Packages githubinstall magicfor dplyr.teradata

Preparation

•  Install Teradata ODBC Driver •  Install Package >library(githubinstall)>githubinstall("dplyr.teradata")

•  Load Library>library(dplyr.teradata)

37

Page 38: R Packages githubinstall magicfor dplyr.teradata

Usage•  Connect to Teradata >tera_db<-src_teradata("schema_name")

•  Create Table Object >table<-tbl(tera_db,"table_name")

•  Construct Query >query<-count(table,column_name)

•  Send Query >result<-collect(query)

38

Page 39: R Packages githubinstall magicfor dplyr.teradata

Construct Query

•  Construct Queries using dplyr Verbs >query<-table%>%+select(gender,reg_date)%>%+filter(reg_date=="2017-04-01")%>%+group_by(gender)%>%+summarise(count=n())

•  They are converted to SQL implicitly.

39

Page 40: R Packages githubinstall magicfor dplyr.teradata

Check SQL

•  Check Converted SQL >show_query(query)<SQL>SELECT"gender",count(*)AS"count"FROM(SELECT*FROM(SELECT"gender"AS"gender","reg_date"AS"reg_date"FROMtable_name)AS"nwkksckhfq"WHERE("reg_date"='2017-04-01')AS"vmmugivqkw”GROUPBY"gender"

40

Page 41: R Packages githubinstall magicfor dplyr.teradata

NOTE

•  Generated SQL are usually redundant. •  We can only pray that the compiler will

do well.

41

Page 42: R Packages githubinstall magicfor dplyr.teradata

TIPS

•  In most cases, a large amount of data is stored in Teradata.

•  We should take care some points when extracting data for saving time. –  Establish connection with schema –  Use filter() and select()–  Use summarise() whenever possible –  Check explain(query) before send

42

Page 43: R Packages githubinstall magicfor dplyr.teradata

Query Execution Plan >explain(query)<PLAN>1)First,welockstable.table_nameforreadonareservedRowHashtopreventglobaldeadlock.2)Next,welockstable.table_nameforread.3)Wedoanall-AMPsSUMsteptoaggregatefrom..5)Finally,wesendoutanENDTRANSACTIONsteptoallAMPsinvolvedinprocessingtherequest.->ThecontentsofSpool1aresentbacktotheuserastheresultofstatement1.Thetotalestimatedtimeis2.05seconds.

43

Page 44: R Packages githubinstall magicfor dplyr.teradata

Join Across Schema

•  When you need to join tables across schemas, you should establish connections without schemas.

>db<-src_teradata("")>table1<-tbl(db,"schema1.table1")>table2<-tbl(db,"schema2.table2")>left_join(table1,table2,by="id")

44

Page 45: R Packages githubinstall magicfor dplyr.teradata

Summary

•  The dplyr.teradata package provides a way to extract data using dplyr verbs from Teradata.

•  It is a beta version. •  We welcome your bug reports and

contributions. https://github.com/hoxo-m/dplyr.teradata

45