50
© 2014 IBM Corporation Bluemix Hadoop Beginner’s Guide -- Part I Joseph Chang Senior IT Specialist IBM Cloud Document number Ambari HDFS Explore WebHDFS API Connect with R Console Machine Learning (lm, k-means)

Bluemix hadoop beginners Guide part I

Embed Size (px)

Citation preview

Page 1: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Bluemix Hadoop Beginner’s Guide -- Part I

Joseph Chang

Senior IT Specialist

IBM Cloud

Document number

• Ambari• HDFS Explore• WebHDFS API• Connect with R Console• Machine Learning (lm, k-means)

Page 2: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Reference:

https://www.ng.bluemix.net/docs/services/AnalyticsforHadoop/index.html#analyticsforhadoop_data

2

Take me to BluemixClick Here

Page 3: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Are you the target reader?

3

Have you heard about Bluemix?

Do you know Hadoop?

Do you know R language?

Are you interested in have the three things working

together?

Yes

Yes

Yes

Yes Continue to the next page.

Learn about Bluemix and sign-up.http://www.bluemix.net

Learn about Hadoophttps://hadoop.apache.org/

Learn about R.https://www.r-project.org/

No

No

No

NoBye

Page 4: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

The following 2 Bluemix services are used in this tutorial :

4

Assume you already have Bluemix id. If you don’t , go to http://ww.bluemix.net to

get one.

Page 5: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Create Hadoop Service in Bluemix

5

Please create a java runtime and add a hadoop

service by yourself.

Page 6: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Create Hadoop Service in Bluemix

6

You can get the AmbariUrl, WebhdfsUrl, id, password

… etc. from “Show Credentials”.

Page 7: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Ambari Hadoop Management

7

Page 8: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Monitoring Hadoop with Ambari

8

eg. https://bi-hadoop-prod-2016.services.dal.bluemix.net:8081

https://bi-hadoop-prod-<Cluster ID>.services.dal.bluemix.net:8081

Launch the Ambari Dashboard with this

URL.

Page 9: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Ambari – View the detail information of each services

9

Note: Spark service is

available in this environment.

Page 10: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Ambari – Hosts

10

The server nodes in this Hadoop

Cluster.

Page 11: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Ambari – Cluster Stack Version

11

The Big R Service will be used in this tutoral.

Page 12: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

HDFS Explore

12

Page 13: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

HDFS Explore

13

eg. https://bi-hadoop-prod-2016.services.dal.bluemix.net:8443/gateway/default/hdfs/explorer.html

https://bi-hadoop-prod-<Cluster ID>.services.dal.bluemix.net:8443/gateway/default/hdfs/explorer.html

Launch the HDFS Explore with this

URL.

View the files on the Hadoop File

System. It’s ready only.

Page 14: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

HDFS – Healthy

14

Page 15: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

WEBHDFS REST API

15

Page 16: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Upload Data with curl + webhdfs rest api

16

curl -i -L -k -s --user biblumix:<your_biblumix_password> --max-time 45  -X PUT   https://bi-hadoop-prod-<your_cluster_number>.services.dal.bluemix.net:8443/  gateway/default/webhdfs/v1/user/biblumix/<path_to_file/file_name>?op=CREATE

curl -i -L -k -s --user biblumix:<your_biblumix_password> --max-time 45  -X PUT -T <file_name.txt> <Location URL from step 1 response message>

If you can’t run “curl” in your command line, google it and

download it.

Use WEBHDFS API to upload file

The current CREATE api have a

defect cause the uploaded file

size=0. The 2 steps approach is a workaround.

Page 17: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Upload Data with curl + webhdfs rest api (Screen capture)

17

Step 1 Create temp redirect Step 2 Upload file from local disk

The location in step 1 response

message will be used in step 2.

You should get response code 307 in

step 1

You should get response code 201 in

step 2

Page 18: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Upload Data with curl + webhdfs rest api (Result)

18

The file has been

uploaded.

Note the size should not be

0.

Page 19: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

More webhdfs rest api

19

curl -i -k -s --user biblumix:your_biblumix_password --max-time 45  https://bi-hadoop-prod-your_cluster_number.services.dal.bluemix.net:8443/  gateway/default/webhdfs/v1/user?op=LISTSTATUS

curl -i -s --user biblumix:passwordhttps://hostname:8443/gateway/default/oozie/v1/jobs?jobtype=wf

curl -i -s --user biblumix:password -X POST -H "Content-Type: application/xml" -d @oozie-mrjob-config.xml https://hostname:8443/gateway/default/oozie/v1/jobs?action=start

curl -i -s --user biblumix:your_biblumix_password --max-time 45 -X DELETE  https://bi-hadoop-prod-your_cluster_number.services.dal.bluemix.net:8443/  gateway/default/webhdfs/v1/user/biblumix/path_to_file?op=DELETE

curl -i -k -s --user biblumix:your_biblumix_password --max-time 45 -X PUT  https://bi-hadoop-prod-your_cluster_number.services.dal.bluemix.net:8443/  gateway/default/webhdfs/v1/user/biblumix/path_to_directory?op=MKDIRS

Page 20: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Install R Console & Big R

20

Page 21: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Download Drivers for Big R

21

https://hub.jazz.net/project/kulkarni/a4h/overview#https://hub.jazz.net/git/kulkarni%252Fa4h/list/master/client-libs

Extract the file to /temp

The big R library can be download

from this url.

Page 22: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Install R Console

22

Download R Language https://cran.r-project.org/

Launch R console: Launch R in Terminal:

If you don’t have R console in your

PC/NB . Download it from this URL.

You can use either R Console

or terminal.

Type R in command line to launch R.

Page 23: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Install Big R

> install.packages('rJava') --- Please select a CRAN mirror for use in this session ---

HTTPS CRAN mirror

1: 0-Cloud [https] 2: Austria [https]

3: Chile [https] 4: China (Beijing 4) [https]

5: China (Hefei) [https] 6: Colombia (Cali) [https]

7: France (Lyon 2) [https] 8: Germany (Münster) [https]

9: Iceland [https] 10: Russia (Moscow) [https]

11: Spain (A Coruña) [https] 12: Switzerland [https]

13: UK (Bristol) [https] 14: UK (Cambridge) [https]

15: USA (CA 1) [https] 16: USA (KS) [https]

17: USA (MI 1) [https] 18: USA (TN) [https]

19: USA (TX) [https] 20: USA (WA) [https]

21: (HTTP mirrors)

Selection: 1

trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.2/rJava_0.9-7.tgz' Content type 'application/x-gzip' length 604271 bytes (590 KB) ================================================== downloaded 590 KB

The downloaded binary packages are in

/var/folders/g0/jgl74nkx0h97dgpywqv2prrc0000gn/T//Rtmp3ggvb8/downloaded_packages

23

Warning message:In doTryCatch(return(expr), name, parentenv, handler) : unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so': dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so Reason: image not found>

Before install big R package. We

need install rJava.

Page 24: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Install Big R

24

> install.packages('base64enc')trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.2/base64enc_0.1-3.tgz'Content type 'application/x-gzip' length 26679 bytes (26 KB)==================================================downloaded 26 KB

The downloaded binary packages are in/var/folders/g0/jgl74nkx0h97dgpywqv2prrc0000gn/T//Rtmp3ggvb8/downloaded_packages

> install.packages('data.table')

also installing the dependencies ‘stringi’, ‘magrittr’, ‘plyr’, ‘stringr’, ‘Rcpp’, ‘chron’, ‘reshape2’

trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.2/stringi_0.5-5.tgz'Content type 'application/x-gzip' length 12685069 bytes (12.1 MB)==================================================downloaded 12.1 MB

….

trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.2/data.table_1.9.4.tgz'Content type 'application/x-gzip' length 1266610 bytes (1.2 MB)==================================================downloaded 1.2 MB

The downloaded binary packages are in/var/folders/g0/jgl74nkx0h97dgpywqv2prrc0000gn/T//Rtmp3ggvb8/downloaded_packages

>

Before install big R package. We need install base64enc

and data.table

Page 25: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Install Big R

25

> install.packages(pkg="/temp/bigr_3.18.tar.gz", type="source", repos=NULL)

* installing *source* package ‘bigr’ ...** R** inst** preparing package for lazy loadingAttaching...Creating a generic function for ‘toString’ from package ‘base’ in package ‘bigr’Creating a generic function for ‘nchar’ from package ‘base’ in package ‘bigr’Creating a generic function for ‘coef’ from package ‘stats’ in package ‘bigr’** help*** installing help indices** building package indices** testing if installed package can be loaded* DONE (bigr)>

Now you can install Big R

library.

Make sure the library path is

correct.

Page 26: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Install Big R (2 issues in Bluemix doc)

26

If you copy the command in

Bluemix doc , you may got this error.

(Aug.2015)

It should be “packages”

The Bluemix instruction dosen’t

mention the 3 libraries need to be installed first.

Page 27: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Machine Learning

-- Liner Regression-- K-means

27

http://www-01.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsights.bigr.doc/doc/intro.html?cp=SSPT3X_4.0.0%2F9-1

Reference:

Recommendation:Learn more about

big R from the URL.

Page 28: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Machine Learning – Big R Example #1-1

28

############################# 1.1 Connect to Bluemix Hadoop#############################

# In order to try out any example, first run the following steps to upload# the aforementioned dataset to a BigInsights cluster. library(bigr)

bigr.connect(host="bi-hadoop-prod-2016.services.dal.bluemix.net",user="biblumix", password="w9@4f0~HnXLD",ssl=TRUE, trustStorePath="/Library/Java/Home/lib/security/cacerts", trustStorePassword="changeit",keyManager="SunX509")

is.bigr.connected()

Replace it with your own cluster id

Replace it with your own password

Replace it with the Java Home path in your Environment.

Page 29: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Machine Learning – Big R Example #1-2

29

################## 1.2 Data loading#################

airfile <- system.file("extdata", "airline.zip", package="bigr”)airfile <- unzip(airfile, exdir = tempdir())airR <- read.csv(airfile, stringsAsFactors=F)

# Upload the data to the BigInsights server. This may take 15-20 secondsair <- as.bigr.frame(airR)air <- bigr.persist(air, dataSource="DEL", dataPath="/user/bigr/examples/airline_demo.csv”, header=T, delimiter=",", useMapReduce=F)

The file uses “,” as DELimiter

Page 30: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Big R Example #1 (Screen capture)

30

You can check if the file is

successfully upload by HDFS explore.

You should get “TRUE” if successfully

connect to bluemix hadoop.

Page 31: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

About the airline.csv sample data

31

The airline.zip sample can be found in your R installation directory.

Page 32: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Machine Learning – Big R Example #2

32

############################ 2. Accessing data on HDFS###########################

# Once uploaded, one merely needs to instantiate a big.frame object,# commonly referenced as "air" in the examples, to access the dataset via# the Big R API.air <- bigr.frame(dataPath = "/user/bigr/examples/airline_demo.csv", dataSource = "DEL", delimiter=",", header = T, coltypes = ifelse(1:29 %in% c(9,11,17,18,23), "character", "integer"), useMapReduce = F)

There are 29 columns in the

airline_dmeo.csv file. Column

9,11,17,18,23 are character. Remaining columns are integer.

Page 33: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Big R Example #2 (Screen capture)

33

Page 34: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Machine Learning – Big R Example #3-1

34

################################################################## 3. Machine Learning example: building a Linear Regression model#################################################################

# Remove files from previous executions (if any)invisible(bigr.rmfs("/user/bigr/examples/airline.sample.* /user/bigr/examples/lm.airline*"))

# Project some relevant columns for modeling / statistical analysisairlineFiltered <- air[, c("Month", "DayofMonth", "DayOfWeek", "CRSDepTime", "Distance", "ArrDelay")]

# Create a bigr.matrix from the dataairlineMatrix <- bigr.transform(airlineFiltered, outData="/user/bigr/examples/airline.sample.matrix", transformPath="/user/bigr/examples/airline.sample.transform")

The 6 variables are choose for this model.

Page 35: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Machine Learning – Big R Example #3-2

35

################################################################## 3. Machine Learning example: building a Linear Regression model#################################################################

# Split the data into 70% for training and 30% for testingsamples <- bigr.sample(airlineMatrix, perc=c(0.7, 0.3))train <- samples[[1]]test <- samples[[2]]

# Create a linear regression modellm <- bigr.lm(ArrDelay ~ ., data=train, directory="/user/bigr/examples/lm.airline")

# Get the coefficients of the regressioncoef(lm)

We will use "Month", "DayofMonth", "DayOfWeek", "CRSDepTime”, "Distance” to predict ArrDelay

Page 36: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Big R Example #3 (Screen Capture)

36

Page 37: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Big R Example #3-2 (Result)

37

Y : ArrDelayX1: MonthX2: DayofMonthX3: DayOfweekX4: CRSDepTImeX5: Distance

Y = -0.174423*X1 -0.01547941*X2-0.03378236*X3 +0.006222544*X4+0.0003556919*X5

The Arrival Delay prediction model is :

Page 38: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Machine Learning – Big R Example #3-3

38

################################################################## 3. Machine Learning example: building a Linear Regression model#################################################################

# Calculate predictions for the testing setpred <- predict(lm, test, "/user/bigr/examples/lm.airline.preds")

Page 39: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Big R Example #3 (Screen Capture)

39

Predicted arrival delay time for test data.

Page 40: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Big R Example #3 (output)

40

View the preds files generate on the hdfs.

Page 41: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Machine Learning – Big R Example #4

41

################################################################### 4. Machine Learning example: building a k-means clustering model##################################################################

# Remove files from previous executions (if any)invisible(bigr.rmfs("/user/bigr/examples/iris.* /user/bigr/examples/km*"))

# Load the Iris dataset to HDFSirisbf <- as.bigr.frame(iris[, -5])

# Convert the Iris dataset into a bigr.matrix objectirisBM <- bigr.transform(bf = irisbf, outData = "/user/bigr/examples/iris.mtx", transformPath = "/user/bigr/examples/iris.transform")

# Create a k-means model with 10 clusterskm <- bigr.kmeans(irisBM, centers=10, directory="/user/bigr/examples/km", writeY=T)

# Use the existing model to cluster a different datasetp <- predict(km, irisBM, "/user/bigr/examples/km.preds")

Iris is the built-in sample data set in R Language

Page 42: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

About the sample data -- IRIS

42

Page 43: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Big R Example #4 (Screen Capture)

43

The 10 clusters of IRIS by k-means.

Page 44: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Big R Example #4 (Screen capture)

44

Identify each sample data with the model.

Page 45: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Thank you

45

Take me to BluemixClick Here

Page 46: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Appendix 1: Hadoop Cloud Demo with IBM Bluemix

46

I found this great video in Youtube.You can learn more about Bluemix Hadoop in this video.

Page 47: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Big Data Hadoop Cloud Demo – IBM Bluemixhttps://www.youtube.com/watch?v=FUDOsBDAahE

47

Page 48: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

Appendix 2: Define Hadoop Cluster by yourself

48

If you want your application run faster. You may choose this charged service which running on bare metal servers with multiple nodes.

Page 49: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

BigInsights for Hadoop Cluster Topology

49

Page 50: Bluemix hadoop beginners Guide part I

© 2014 IBM Corporation

BigInsights for Hadoop Cluster Topology

50