Upload
er-dipendra-kusi
View
87
Download
1
Embed Size (px)
Citation preview
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
Installation and Setup Spark
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
Step 1: First setup the Cloudera
Step 2: Open terminal in Cloudera and start spark
usr/bin/spark-shell
Step 3: After start of spark we can write scala command to execute in spark using spark context
Now read the file from hdfs. Here there is input file in hdfs
val dt = sc.textFile("/user/cloudera/project_data/input") We can keep file in hdfs using:
hadoop fs -put file0 /user/cloudera/project_data/input
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
Step 4: Now, we will split the text content based on whitespace and then count the word
val wordcount = dt.flatMap(x=>x.split(" ")).map(x=>(x,1)) .reduceByKey((a,b)=>a+b))
Step 5: Now print the result:
for(value <- wordcount) {println(value)}
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
Integrate the Spark in eclipse:
Step 1: First go to eclipse and setup the scala plugin.
Go to Help-> Eclipse Market Place
Step 2: Now search scala plugin and install the plugin
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
Click on install
Click on confirm
Then, Accept and install
Step 3: Now, check whether scala plugin is installed or not in eclipse
Go to New-> other-> type scala
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
If there is scala App then scala plugin is installed
Step 4: Now create maven project
Got to New->other-> type maven project -> next->next->next
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
Step 5: Now give the
Group Id: edu.sparkproject
Artifact Id: WordCount
Click Finish
Step 6:
Now go to pom.xml file and edit dependency to spark
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
Step 7: Now Copy and paste the code below in pom.xml
Link: http://pastebin.com/V5n0hM5P
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.scalaproject</groupId> <artifactId>scalaproject</artifactId> <version>0.0.1-SNAPSHOT</version> <pluginRepositories> <pluginRepository> <id>scala-tools.org</id> <name>Scala-tools Maven2 Repository</name> <url>http://scala-tools.org/repo-releases</url> </pluginRepository> </pluginRepositories> <repositories> <repository> <id>pele.farmbio.uu.se</id> <url>http://pele.farmbio.uu.se/artifactory/libs-snapshot</url> </repository> </repositories> <dependencies>
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.6.0</version> </dependency> </dependencies> <build> <plugins> <!-- mixed scala/java compile --> <plugin> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <executions> <execution> <id>compile</id> <goals> <goal>compile</goal> </goals> <phase>compile</phase> </execution> <execution> <id>test-compile</id> <goals> <goal>testCompile</goal> </goals> <phase>test-compile</phase> </execution> <execution> <phase>process-resources</phase> <goals> <goal>compile</goal> </goals> </execution> </executions> </plugin> <plugin> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.7</source> <target>1.7</target> </configuration> </plugin> <!-- for fatjar --> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assembly-plugin</artifactId> <version>2.4</version> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration>
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
<executions> <execution> <id>assemble-all</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-jar-plugin</artifactId> <configuration> <archive> <manifest> <addClasspath>true</addClasspath> <mainClass>fully.qualified.MainClass</mainClass> </manifest> </archive> </configuration> </plugin> </plugins> <pluginManagement> <plugins> <!--This plugin's configuration is used to store Eclipse m2e settings only. It has no influence on the Maven build itself. --> <plugin> <groupId>org.eclipse.m2e</groupId> <artifactId>lifecycle-mapping</artifactId> <version>1.0.0</version> <configuration> <lifecycleMappingMetadata> <pluginExecutions> <pluginExecution> <pluginExecutionFilter> <groupId>org.scala-tools</groupId> <artifactId> maven-scala-plugin </artifactId> <versionRange> [2.15.2,) </versionRange> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </pluginExecutionFilter> <action> <execute></execute> </action> </pluginExecution> </pluginExecutions>
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
</lifecycleMappingMetadata> </configuration> </plugin> </plugins> </pluginManagement> </build> </project>
Now save it. It will download all the dependency.
Step 8: Now convert the project into Scala project
First delete the src/test/java folder
Now fix the error by clicking in quick fix and ok.
The error will disappear.
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
Step 9: Now convert project into Scala Nature
Step 10:
Right click on project -> properties
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
Step 11:
Now go to Scala Compiler -> tick on Use Project Setting -> select Fixed Scala Installation 2.10.6-> Apply ->
Ok
(Spark only support Scala version 2.10 so we need to match the scala version running on Spark )
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
Step 12: Then go to Java Build Path -> remove Scala Library Container
(Spark core contain Scala Library Container so no need to have library here)
Now rename the package to Scala
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
Step 13: Now add the Scala Object File
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
Give the Scala Object Name -> Count
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
Step 14:
Now copy code from and paste into Word.scala file
Link: http://pastebin.com/XNpbcJ2z
package com.scalaproject.scalaproject import org.apache.spark.SparkConf import org.apache.spark.SparkContext import java.nio.file.{Paths, Files} import java.io._ import org.apache.commons.io.FileUtils import org.apache.commons.io.filefilter.WildcardFileFilter import scala.collection.immutable
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
object WordCount { def main(args: Array[String]) = { //Start the Spark context val conf = new SparkConf() .setAppName("WordCount") .setMaster("local") val sc = new SparkContext(conf) val test = sc.textFile("input.txt") test.flatMap( x => x.split("\\s+")).map(x=>(x,1)).reduceByKey((a,b)=>a+b).saveAsTextFile("output") //Stop the Spark context sc.stop } def splitting(v:String): Array[String] = { v.split(" ") } }
Step 15:
Now add the input.txt file as input file to be processed.
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
Add the text to input.txt file so that we can process it.
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP
Step 16: Now run the code
Step 17: Refresh the project.
You will see the output folder in the project-> go inside it there will be part-0000 that contain the output
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193
2/11/17 SPARK SETUP