Final Report - Spark

5/25/2015

By | Danyal , Baqir and Shoaib

IBA-FCS INTRODUCTORY DATA ANALYTICS WITH APACHE

SPARK

Table of Contents

Setting Up Network Of Oracle Virtual Box: ....................................................................................................................... 3

Setup Spark On Windows 7 Standalone Mode: .................................................................................................................. 4

Setup Spark On Linux: ......................................................................................................................................................... 5

Spark Features: ...................................................................................................................................................................... 5

Spark Standalone Cluster Mode: ............................................................................................................................ 6

Setup Master Node ................................................................................................................................................... 6

Starting Cluster Manually .................................................................................................................................................... 7

Starting & Connecting Worker To Master ......................................................................................................................... 8

Submitting Applications To Spark ....................................................................................................................................... 9

Interactive Analysis With The Spark Shell ......................................................................................................................... 9

Spark Submit ......................................................................................................................................................................... 9

Custom Code Execution & Hdfs File Reading .................................................................................................................. 10

Setting up network of Oracle Virtual Box:

1. When starting the VM set the Attached to value to Bridged Adapter (to make it connect to other

VMs on the network , which reside on different Host machines)

2. Refresh the MAC Address by clicking the REFRESH Icon, so that every VM must have a different

MAC to have a unique IP assigned to it.

Figure 1: VM setting to follow when creating the network

3. Then make sure you are able to PING from Host to VM and vice versa when the VM starts after

applying the above settings.

Setup SPARK on Windows 7 Standalone Mode:

Prerequisites:

Java6+

Scala 2.10

Python 2.6 +

Spark 1.2.x

sbt ( In case of building Spark Source code)

GIT( If you use sbt tool)

Environment Variables:

Set JAVA_HOME and PATH variable as environment variables.

Download Scala 2.10 and install

Set SCALA_HOME andadd %SCALA_HOME%\bin in PATH variable in environment variables. To test

whether Scala is installed or not, run following command.

Downloading & Setting up Spark:

Choose a Spark prebuilt package for Hadoop i.e. Prebuilt for Hadoop 2.3/2.4 or later. Download and

extract it to any drive i.e. D:\spark-1.2.1-bin-hadoop2.3

Set SPARK_HOME and add %SPARK_HOME%\bin in PATH in environment variables

Download winutils.exe and place it in any location (i.e. D:\winutils\bin\winutils.exe) to avoid any

Hadoop Errors

Set HADOOP_HOME = D:\winutils in environment variable

Now, Re run the command “spark-shell’ , you’ll see the scala shell .

For Spark UI : open http://localhost:4040/ in browser

ctrl + z to get out of it when it executes successfully .

For testing the successful setup you can run the example :

http://www.scala-lang.org/download/2.10.5.html

http://spark.apache.org/downloads.html

http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe

http://localhost:4040/

If all goes fine , this will execute this sample program and return the result on console .

And that’s how you have setup Spark on windows 7 .

Master Web UI can be accessed on url SPARK://IP:7077

Setup SPARK on Linux: Download Hortonworks Sandbox ,with Spark Installed and Configured from following link

http://hortonworks.com/hdp/downloads/

HDP 2.2.4 on Sandbox.

You can find Spark in this directory

o /usr/hdp/2.2.4.2-2/spark/bin

You are now all ready to go as HortonWorks have this setup for you pre-packaged.

SPARK Features: Speed - Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory

computing.

Ease of Use – Write apps quickly in Java , Scala or Python . Spark offer 80 high level operators that

make it easy to build parallel apps.

Generality – Combines SQL , Streaming and complex analytics , powers its stack with high level tools

including Spark SQL , MLib (for machine learning and AI),GraphX & Spark streaming .

http://hortonworks.com/hdp/downloads/

Runs Everywhere – Spark runs on Hadoop , standalone Mode or in cloud as well , can access diverse

data sources including HDFS , Cassandra , Hbase and S3 .

SPARK Standalone Cluster Mode:

There are 2 other Spark deployment modes i.e. YARN and MESOS but here we will only talk about

standalone mode.

Spark is already installed on HortonWorks VM in standalone mode on your VM node. Simply acquire

pre-built version of spark for any future implementations.

Setup Master Node

1. Edit etc/hosts file by VI Editor

2. Edit the file like below , here we have shown 2 slaves (which must be setup the same

way as this VM , and we must enter information like the same way in etc hosts file

on slaves too)

3. Then You should change you Masters host name machine to Master and slaves

machine to slave1 and slave2 respectively as shown below

4. Now on slave machines, there is one more additional step to be performed and that

is to create a file called conf/slaves in your Spark directory, which must contain the

hostnames of all the machines where you intend to start Spark workers, one per

line. If conf/slaves does not exist, the launch scripts defaults to a single machine

(localhost), which is useful for testing only.

5. Now if everything have gone according to plan then your Multi Cluster Spark setup

over Network is up and running and you can verify by Ping to other machines from

every machine , in our case there were 3 machines (1 master , 2 slaves)

Starting Cluster Manually

You can start standalone master server by executing the following

./sbin/start-master.sh

Once started, the master will print out the MASTER URL i.e.

Spark: //HOSTNAME:PORT, which will be used to connect workers to it.

This url can also be find on WebUI , whose default is url is http://localhost:8080

sbin/start-slaves.sh - Starts a slave instance on each machine specified in

the conf/slaves file.

sbin/start-all.sh - Starts both a master and a number of slaves as described above.

sbin/stop-master.sh - Stops the master that was started via the bin/start-master.sh script.

sbin/stop-slaves.sh - Stops all slave instances on the machines specified in

the conf/slaves file.

sbin/stop-all.sh - Stops both the master and the slaves as described above.


Starting & Connecting Worker to Master

You can start and connect worker(S) to master via this command

./bin/spark-class org.apache.spark.deploy.worker.Worker spark://HOST:PORT Once you have started a worker, look at the master’s web UI (http://localhost:8080 by

default). You should see the new node listed there, along with its number of CPUs and

memory (minus one gigabyte left for the OS).

Now once you connect workers to master successfully (checked it by browsing the web UI ,

salves machines will be shown there (like below) , any task you run from master to slave will

be distributed to all the available machines in network to be worked on in parallel .

Figure 2 : Master Web UI


Submitting Applications to Spark

We have 2 ways to do this.

Interactive Analysis with the Spark Shell

Pyspark

./bin/pyspark –master spark://IP:port

Spark-shell

./bin/spark-shell – master spark://IP:port

There are many parameters to be passed with above commands, links of which are given at the

end of this document. Running pyspark or spark-shell commands will open up an interactive

shell for you to work on, write code line by line, pressing enter, for example as below

textFile = sc.textFile("README.md")

textFile.count() # Number of items in this RDD

textFile.first() # First item in this RDD

where ‘sc’ is SPARK CONTEXT object , which is made available by spark when you run either

pyspark or spark-shell commands . Behind the scenes, spark-shell invokes the more

general spark-submit script.

Spark Submit

Once a user application is bundled, it can be launched using the bin/spark-submit script. This

script takes care of setting up the classpath with Spark and its dependencies, and can support

different cluster managers and deploy modes that Spark supports:

For simple example consider this from inside spark folder, execute something like this:

./bin/spark-submit --master spark://master:7077 k-means.py

Where master means where to to submit the app and then the file name to run , either of scala

, java or python .

https://spark.apache.org/docs/1.3.0/submitting-applications.html

Template for spark submit command

./bin/spark-submit

--class <main-class>

--master <master-url>

--deploy-mode <deploy-mode>

--conf <key>=<value>

... # other options

<application-jar>

[application-arguments]

Some of the commonly used options are:

--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)

--master: The master URL for the cluster (e.g. spark://hostname or IP:7077)

--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an

external client (client) (default: client)

--conf: Arbitrary Spark configuration property in key=value format. For values that contain

spaces wrap “key=value” in quotes (as shown).

application-jar: Path to a bundled jar including your application and all dependencies. The URL

must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is

present on all nodes.

application-arguments: Arguments passed to the main method of your main class, if any

Custom code execution & HDFS File reading

To read / access file text inside your python or scala programs , the file must reside inside HDFS

,only then will sc.textFile(“file.txt”) will be able to read that and access the content . See the

commands below as sample to create HDFS directory and put a file in it

hadoop fs –mkdir hdfs://master/user/hadoop/spark/data/

to create directory inside HDFS with given path uri

hadoop fs –ls hdfs://master/user/hadoop/spark/data/

to view directory contents inside HDFS with given path uri

hadoop fs –put home/sample.txt hdfs://master/user/hadoop/spark/data/

to upload file onto HDFS from local Linux directory (home/ in this case)

https://spark.apache.org/docs/1.3.0/submitting-applications.html#master-urls

hadoop fs –get hdfs://master/user/hadoop/spark/data/sample.txt home/

to download file from HDFS dir to local Linux directory (home/ in this case)

Sample Code given below which read a file from HDFS directory

1. from pyspark.mllib.clustering import KMeans 2. from numpy import array 3. from math import sqrt 4. from pyspark import SparkContext 5. import time 6. start_time = time.time() 7. 8. sc = SparkContext(appName="K means") 9. # Load and parse the data 10. data = sc.textFile("hdfs://master/user/hadoop/spark/data/kmeans.csv") 11. header = data.first() 12. parsedData = data.filter(lambda x: x != header).map(lambda line: array([float(x) for x in li

ne.split(',')])) 13. 14. # Build the model (cluster the data) 15. clusters = KMeans.train(parsedData, 2, maxIterations=100, 16. runs=10, initializationMode="random") 17. 18. # Evaluate clustering by computing Within Set Sum of Squared Errors 19. def error(point): 20. center = clusters.centers[clusters.predict(point)] 21. return sqrt(sum([x**2 for x in (point - center)])) 22. 23. WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) 24. print("Within Set Sum of Squared Error = " + str(WSSSE)) 25. print("--- %s seconds ---" % (time.time() - start_time))

Aggregate Function Benchmarking

Query Windows Spark Cluster .8M records - Total amount spent

on all transaction per customer 30 secs 7 secs

10M records – Transaction count 167 secs 8 secs

10M records – Total amount sum 106 secs 10 secs

K-Means Clustering Benchmarking

Cluster config Windows Spark Cluster K=4, iter = 100 , python, rows = 1048576 82 Secs 25 Secs

K=4, iter = 1000 , python , rows = 1048576 800 Secs 31 Secs

k=4, iter = 1000, Scala , rows = 1048576 - 13 Secs

Refrences:

1. https://spark.apache.org/docs/1.3.0/index.html

2. http://hortonworks.com/hadoop-tutorial/using-commandline-manage-files-hdfs/

3. https://spark.apache.org/docs/1.3.0/spark-standalone.html

4. https://spark.apache.org/docs/1.3.0/quick-start.html

5. https://spark.apache.org/docs/1.3.0/submitting-applications.html

https://spark.apache.org/docs/1.3.0/index.html

http://hortonworks.com/hadoop-tutorial/using-commandline-manage-files-hdfs/

https://spark.apache.org/docs/1.3.0/spark-standalone.html

https://spark.apache.org/docs/1.3.0/quick-start.html

https://spark.apache.org/docs/1.3.0/submitting-applications.html

Documents

Final Report - Spark