Installation and setup spark published

Preview:

Citation preview

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Installation and Setup Spark

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 1: First setup the Cloudera

Step 2: Open terminal in Cloudera and start spark

usr/bin/spark-shell

Step 3: After start of spark we can write scala command to execute in spark using spark context

Now read the file from hdfs. Here there is input file in hdfs

val dt = sc.textFile("/user/cloudera/project_data/input") We can keep file in hdfs using:

hadoop fs -put file0 /user/cloudera/project_data/input

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 4: Now, we will split the text content based on whitespace and then count the word

val wordcount = dt.flatMap(x=>x.split(" ")).map(x=>(x,1)) .reduceByKey((a,b)=>a+b))

Step 5: Now print the result:

for(value <- wordcount) {println(value)}

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Integrate the Spark in eclipse:

Step 1: First go to eclipse and setup the scala plugin.

Go to Help-> Eclipse Market Place

Step 2: Now search scala plugin and install the plugin

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Click on install

Click on confirm

Then, Accept and install

Step 3: Now, check whether scala plugin is installed or not in eclipse

Go to New-> other-> type scala

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

If there is scala App then scala plugin is installed

Step 4: Now create maven project

Got to New->other-> type maven project -> next->next->next

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 5: Now give the

Group Id: edu.sparkproject

Artifact Id: WordCount

Click Finish

Step 6:

Now go to pom.xml file and edit dependency to spark

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 7: Now Copy and paste the code below in pom.xml

Link: http://pastebin.com/V5n0hM5P

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.scalaproject</groupId> <artifactId>scalaproject</artifactId> <version>0.0.1-SNAPSHOT</version> <pluginRepositories> <pluginRepository> <id>scala-tools.org</id> <name>Scala-tools Maven2 Repository</name> <url>http://scala-tools.org/repo-releases</url> </pluginRepository> </pluginRepositories> <repositories> <repository> <id>pele.farmbio.uu.se</id> <url>http://pele.farmbio.uu.se/artifactory/libs-snapshot</url> </repository> </repositories> <dependencies>

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.6.0</version> </dependency> </dependencies> <build> <plugins> <!-- mixed scala/java compile --> <plugin> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <executions> <execution> <id>compile</id> <goals> <goal>compile</goal> </goals> <phase>compile</phase> </execution> <execution> <id>test-compile</id> <goals> <goal>testCompile</goal> </goals> <phase>test-compile</phase> </execution> <execution> <phase>process-resources</phase> <goals> <goal>compile</goal> </goals> </execution> </executions> </plugin> <plugin> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.7</source> <target>1.7</target> </configuration> </plugin> <!-- for fatjar --> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assembly-plugin</artifactId> <version>2.4</version> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration>

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

<executions> <execution> <id>assemble-all</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-jar-plugin</artifactId> <configuration> <archive> <manifest> <addClasspath>true</addClasspath> <mainClass>fully.qualified.MainClass</mainClass> </manifest> </archive> </configuration> </plugin> </plugins> <pluginManagement> <plugins> <!--This plugin's configuration is used to store Eclipse m2e settings only. It has no influence on the Maven build itself. --> <plugin> <groupId>org.eclipse.m2e</groupId> <artifactId>lifecycle-mapping</artifactId> <version>1.0.0</version> <configuration> <lifecycleMappingMetadata> <pluginExecutions> <pluginExecution> <pluginExecutionFilter> <groupId>org.scala-tools</groupId> <artifactId> maven-scala-plugin </artifactId> <versionRange> [2.15.2,) </versionRange> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </pluginExecutionFilter> <action> <execute></execute> </action> </pluginExecution> </pluginExecutions>

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

</lifecycleMappingMetadata> </configuration> </plugin> </plugins> </pluginManagement> </build> </project>

Now save it. It will download all the dependency.

Step 8: Now convert the project into Scala project

First delete the src/test/java folder

Now fix the error by clicking in quick fix and ok.

The error will disappear.

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 9: Now convert project into Scala Nature

Step 10:

Right click on project -> properties

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 11:

Now go to Scala Compiler -> tick on Use Project Setting -> select Fixed Scala Installation 2.10.6-> Apply ->

Ok

(Spark only support Scala version 2.10 so we need to match the scala version running on Spark )

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 12: Then go to Java Build Path -> remove Scala Library Container

(Spark core contain Scala Library Container so no need to have library here)

Now rename the package to Scala

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 13: Now add the Scala Object File

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Give the Scala Object Name -> Count

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 14:

Now copy code from and paste into Word.scala file

Link: http://pastebin.com/XNpbcJ2z

package com.scalaproject.scalaproject import org.apache.spark.SparkConf import org.apache.spark.SparkContext import java.nio.file.{Paths, Files} import java.io._ import org.apache.commons.io.FileUtils import org.apache.commons.io.filefilter.WildcardFileFilter import scala.collection.immutable

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

object WordCount { def main(args: Array[String]) = { //Start the Spark context val conf = new SparkConf() .setAppName("WordCount") .setMaster("local") val sc = new SparkContext(conf) val test = sc.textFile("input.txt") test.flatMap( x => x.split("\\s+")).map(x=>(x,1)).reduceByKey((a,b)=>a+b).saveAsTextFile("output") //Stop the Spark context sc.stop } def splitting(v:String): Array[String] = { v.split(" ") } }

Step 15:

Now add the input.txt file as input file to be processed.

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Add the text to input.txt file so that we can process it.

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 16: Now run the code

Step 17: Refresh the project.

You will see the output folder in the project-> go inside it there will be part-0000 that contain the output

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Recommended