8
Spark Tutorial with Set Up and Basic File Processing CIS 612 1) Download Apache Spark release 2.4.1 from http://spark.apache.org/downloads.html Choose the binary without Hadoop, as Hadoop is already installed and configured on my system 2) Follow Apache documentation for setting up Spark with your own Hadoop installation: http://spark.apache.org/docs/latest/hadoop-provided.html Modify SPARK_DIST_CLASSPATH to include Hadoop’s package jars by adding an entry in conf/spark- env.sh

Spark Tutorial with Set Up and Basic File Processingcis.csuohio.edu/~sschung/cis612/CIS612_SparkBasicProcessingTutorialHeideloff.pdf9) Show schema of review100B 10) Register the SchemaRDDs

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spark Tutorial with Set Up and Basic File Processingcis.csuohio.edu/~sschung/cis612/CIS612_SparkBasicProcessingTutorialHeideloff.pdf9) Show schema of review100B 10) Register the SchemaRDDs

Spark Tutorial with Set Up and Basic File Processing

CIS 612

1) Download Apache Spark release 2.4.1 from http://spark.apache.org/downloads.html

Choose the binary without Hadoop, as Hadoop is already installed and configured on my system

2) Follow Apache documentation for setting up Spark with your own Hadoop installation:

http://spark.apache.org/docs/latest/hadoop-provided.html

Modify SPARK_DIST_CLASSPATH to include Hadoop’s package jars by adding an entry in conf/spark-

env.sh

Page 2: Spark Tutorial with Set Up and Basic File Processingcis.csuohio.edu/~sschung/cis612/CIS612_SparkBasicProcessingTutorialHeideloff.pdf9) Show schema of review100B 10) Register the SchemaRDDs

3) Put JSON files on HDFS to use in Spark

4) Run Spark Shell

Page 3: Spark Tutorial with Set Up and Basic File Processingcis.csuohio.edu/~sschung/cis612/CIS612_SparkBasicProcessingTutorialHeideloff.pdf9) Show schema of review100B 10) Register the SchemaRDDs

5) Get a SQLContext to be able to use SparkSQL

6) Import business100.json file

7) Import review100B.json file

Page 4: Spark Tutorial with Set Up and Basic File Processingcis.csuohio.edu/~sschung/cis612/CIS612_SparkBasicProcessingTutorialHeideloff.pdf9) Show schema of review100B 10) Register the SchemaRDDs

8) Show schema of business100

Page 5: Spark Tutorial with Set Up and Basic File Processingcis.csuohio.edu/~sschung/cis612/CIS612_SparkBasicProcessingTutorialHeideloff.pdf9) Show schema of review100B 10) Register the SchemaRDDs
Page 6: Spark Tutorial with Set Up and Basic File Processingcis.csuohio.edu/~sschung/cis612/CIS612_SparkBasicProcessingTutorialHeideloff.pdf9) Show schema of review100B 10) Register the SchemaRDDs
Page 7: Spark Tutorial with Set Up and Basic File Processingcis.csuohio.edu/~sschung/cis612/CIS612_SparkBasicProcessingTutorialHeideloff.pdf9) Show schema of review100B 10) Register the SchemaRDDs

9) Show schema of review100B

10) Register the SchemaRDDs as tables to be able to query them with Scala SQL

11) Query the businesses table to find businesses rated higher than 4 stars

12) Create a table out of businessids rated higher than 4 stars, then join to review table by businessid to get

the reviews of these highly rated businesses

Page 8: Spark Tutorial with Set Up and Basic File Processingcis.csuohio.edu/~sschung/cis612/CIS612_SparkBasicProcessingTutorialHeideloff.pdf9) Show schema of review100B 10) Register the SchemaRDDs

13) Find funny reviews from this list of good businesses (more than 5 votes for “funny”)