Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Installation and Setup Spark


2/11/17 SPARK SETUP


2/11/17 SPARK SETUP

Step 1: First setup the Cloudera

Step 2: Open terminal in Cloudera and start spark

usr/bin/spark-shell

Step 3: After start of spark we can write scala command to execute in spark using spark context

Now read the file from hdfs. Here there is input file in hdfs

val dt = sc.textFile("/user/cloudera/project_data/input") We can keep file in hdfs using:

hadoop fs -put file0 /user/cloudera/project_data/input


2/11/17 SPARK SETUP

Step 4: Now, we will split the text content based on whitespace and then count the word

val wordcount = dt.flatMap(x=>x.split(" ")).map(x=>(x,1)) .reduceByKey((a,b)=>a+b))

Step 5: Now print the result:

for(value <- wordcount) {println(value)}


2/11/17 SPARK SETUP

Integrate the Spark in eclipse:

Step 1: First go to eclipse and setup the scala plugin.

Go to Help-> Eclipse Market Place

Step 2: Now search scala plugin and install the plugin


2/11/17 SPARK SETUP

Click on install

Click on confirm

Then, Accept and install

Step 3: Now, check whether scala plugin is installed or not in eclipse

Go to New-> other-> type scala


2/11/17 SPARK SETUP

If there is scala App then scala plugin is installed

Step 4: Now create maven project

Got to New->other-> type maven project -> next->next->next


2/11/17 SPARK SETUP

Step 5: Now give the

Group Id: edu.sparkproject

Artifact Id: WordCount

Click Finish

Step 6:

Now go to pom.xml file and edit dependency to spark


2/11/17 SPARK SETUP

Step 7: Now Copy and paste the code below in pom.xml

Link: http://pastebin.com/V5n0hM5P

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.scalaproject</groupId> <artifactId>scalaproject</artifactId> <version>0.0.1-SNAPSHOT</version> <pluginRepositories> <pluginRepository> <id>scala-tools.org</id> <name>Scala-tools Maven2 Repository</name> <url>http://scala-tools.org/repo-releases</url> </pluginRepository> </pluginRepositories> <repositories> <repository> <id>pele.farmbio.uu.se</id> <url>http://pele.farmbio.uu.se/artifactory/libs-snapshot</url> </repository> </repositories> <dependencies>


2/11/17 SPARK SETUP

<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.6.0</version> </dependency> </dependencies> <build> <plugins>  <plugin> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <executions> <execution> <id>compile</id> <goals> <goal>compile</goal> </goals> <phase>compile</phase> </execution> <execution> <id>test-compile</id> <goals> <goal>testCompile</goal> </goals> <phase>test-compile</phase> </execution> <execution> <phase>process-resources</phase> <goals> <goal>compile</goal> </goals> </execution> </executions> </plugin> <plugin> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.7</source> <target>1.7</target> </configuration> </plugin>  <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assembly-plugin</artifactId> <version>2.4</version> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration>


2/11/17 SPARK SETUP

<executions> <execution> <id>assemble-all</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-jar-plugin</artifactId> <configuration> <archive> <manifest> <addClasspath>true</addClasspath> <mainClass>fully.qualified.MainClass</mainClass> </manifest> </archive> </configuration> </plugin> </plugins> <pluginManagement> <plugins>  <plugin> <groupId>org.eclipse.m2e</groupId> <artifactId>lifecycle-mapping</artifactId> <version>1.0.0</version> <configuration> <lifecycleMappingMetadata> <pluginExecutions> <pluginExecution> <pluginExecutionFilter> <groupId>org.scala-tools</groupId> <artifactId> maven-scala-plugin </artifactId> <versionRange> [2.15.2,) </versionRange> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </pluginExecutionFilter> <action> <execute></execute> </action> </pluginExecution> </pluginExecutions>


2/11/17 SPARK SETUP

</lifecycleMappingMetadata> </configuration> </plugin> </plugins> </pluginManagement> </build> </project>

Now save it. It will download all the dependency.

Step 8: Now convert the project into Scala project

First delete the src/test/java folder

Now fix the error by clicking in quick fix and ok.

The error will disappear.


2/11/17 SPARK SETUP

Step 9: Now convert project into Scala Nature

Step 10:

Right click on project -> properties


2/11/17 SPARK SETUP

Step 11:

Now go to Scala Compiler -> tick on Use Project Setting -> select Fixed Scala Installation 2.10.6-> Apply ->

Ok

(Spark only support Scala version 2.10 so we need to match the scala version running on Spark )


2/11/17 SPARK SETUP

Step 12: Then go to Java Build Path -> remove Scala Library Container

(Spark core contain Scala Library Container so no need to have library here)

Now rename the package to Scala


2/11/17 SPARK SETUP

Step 13: Now add the Scala Object File


2/11/17 SPARK SETUP


2/11/17 SPARK SETUP

Give the Scala Object Name -> Count


2/11/17 SPARK SETUP

Step 14:

Now copy code from and paste into Word.scala file

Link: http://pastebin.com/XNpbcJ2z

package com.scalaproject.scalaproject import org.apache.spark.SparkConf import org.apache.spark.SparkContext import java.nio.file.{Paths, Files} import java.io._ import org.apache.commons.io.FileUtils import org.apache.commons.io.filefilter.WildcardFileFilter import scala.collection.immutable


2/11/17 SPARK SETUP

object WordCount { def main(args: Array[String]) = { //Start the Spark context val conf = new SparkConf() .setAppName("WordCount") .setMaster("local") val sc = new SparkContext(conf) val test = sc.textFile("input.txt") test.flatMap( x => x.split("\\s+")).map(x=>(x,1)).reduceByKey((a,b)=>a+b).saveAsTextFile("output") //Stop the Spark context sc.stop } def splitting(v:String): Array[String] = { v.split(" ") } }

Step 15:

Now add the input.txt file as input file to be processed.


2/11/17 SPARK SETUP

Add the text to input.txt file so that we can process it.


2/11/17 SPARK SETUP

Step 16: Now run the code

Step 17: Refresh the project.

You will see the output folder in the project-> go inside it there will be part-0000 that contain the output


2/11/17 SPARK SETUP

Data & Analytics

Installation and setup spark published