D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2017

Doc 8 Spark Intro Modified Sep 26, 2017

Copyright ©, All rights reserved. 2017 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http://www.opencontent.org/opl.shtml) license defines the copyright on this document.

http://www.opencontent.org/opl.shtml

Spark

2

Created at UC Berkeley’s AMPLab

2009 Project started 2014 May Version 1.0 2016 July Version 2.0.2 2017 July Version 2.2.0

Programming interface for Java, Python, Scala, R

Interactive shell for Python, Scala, R (experimental)

Runs on Linux, Mac, Windows

Cluster manager Native Spark cluster Hadoop YARN Apache Mesos

File System HDFS MapR File System Cassandra OpenStack Swift S3

Pseudo-Distributed Mode Single machine Uses local file system

Python vs Scala on Spark

3

Scala is faster that Python But that is not so important here Most of the computation on Spark is done in Spark

Using Python with Spark Python data has to be

Converted between Python format and Scala/Java format Sent between Python process and JVM

Time Line

4

1991 - Java project started

1995 - Java 1.0 released, Design Patterns book published

2000 - Java 3

2001 - Scala project started

2002 - Nutch started

2004 - Google MapReduce paper

Scala version 1 released

2005 - F# released

2006 - Hadoop split from Nutch

Scala version 2 released

2007 - Clojure released

2009 - Spark project started

2012 - Hadoop 1.0

2014 - Spark 1.0

Major Parts of Spark

5

Spark Core Resilient Distributed Dataset (RDD)

Spark SQL SQL, csv, json Dataframe

Spark Streaming Near real-time response

MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction Optimization

GraphX

Spark

6

Ecosystem of packages, libraries and systems on top of Spark Core

Unstructured API Structured API

Resilient Distributed Datasets (RDD) Accumulators Broadcast variables

DataFrames Datasets Spark SQL

Newer, faster, higher level Preferred over Unstructured

Basic Architecture

7

Local Mode

8

Driver Process

Sparksession

UserCode

WorkerThreads

We will start using local mode

Use local mode to Develop Spark code

SparkContext

9

Connection to Spark cluster Runs on master node Used to create RDDs, accumulators, broadcast variables Only one SparkContext per JVM stop() the current SparkContext before starting another

SparkContext org.apache.spark.SparkContext Scala version

JavaSparkContext org.apache.spark.api.java.JavaSparkContext Java version

Entry point for Unstructured API

SparkSession

10

Contains a SparkContext

Entry point to use Dataset & DataFrame

Connection to Spark cluster Runs on master node

org.apache.spark.sql.SparkSession

Major Data Structures

11

Resilient Distributed Datasets (RDDs) Fault-tolerant collection of elements that can be operated on in parallel

Dataset & Dataframes Fault-tolerant collection of elements that can be operated on in parallel Rows & Columns JSON, csv, SQL tables Part of SparkSQL

Partitions

12

RDD & Dataset Divided into partitions Each partition is on different machine

Resilient & Distributed

13

Distributed Partitions on different machines

Resilient Each partition can be replicated on multiple machines Data structure knows how to reproduce operations

Basic Operations

14

RDDs, Dataframes, Datasets Immutable

Transformations Create new dataset (RDD) from existing one Lazy Only done when needed by an action Examples

map, filter, sample, union, distinct, groupByKey, repartition

Actions Return results to driver program Examples

reduce, collect, count, first, take

Actions & Transformations on DataSet

15

org.apache.spark.sql.Dataset

View the Spark Scala API

Starting Spark Scala REPL

16

./bin/spark-shell

From Spark installation

scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@508abc74

scala> spark res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@1c618295

Provided variables - Spark Scala REPL only

Starting spark shell

17

Al pro 9->spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/09/24 19:57:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/09/24 19:58:03 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Spark context Web UI available at http://192.168.1.9:4040 Spark context available as 'sc' (master = local[*], app id = local-1506308274361). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information.

Sample Interaction

18

where - Transformation count - Action

scala> val range = spark.range(100) range: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> val rangeWithLabel = range.toDF("number") rangeWithLabel: org.apache.spark.sql.DataFrame = [number: bigint]

scala> val divisibleBy2 = rangeWithLabel.where("number % 2 = 0") divisibleBy2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [number: bigint]

scala> divisibleBy2.count() res2: Long = 50

Sample Interaction

19

scala> val filePath = "/Users/whitney/test/README.md" filePath: String = /Users/whitney/test/README.md

scala> val textFile = spark.read.textFile(filePath) textFile: org.apache.spark.sql.Dataset[String] = [value: string]

scala> textFile.count() res3: Long = 103

scala> textFile.first() res4: String = # Apache Spark

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]

scala> linesWithSpark.count() res5: Long = 20

spark.read returns DataFrameReader

20

Can read cvs jdbc json ORC Parquet text

Setting number of Workers

21

./bin/spark-shell --master local[4]

Application using SBT

22

name := "Simple Project"

version := "1.0"

scalaVersion := “2.11.11"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"

build.sbt

Sample Program Using RDDs

23

import org.apache.spark.{SparkConf, SparkContext}

object MasterConnect { def main(args: Array[String]): Unit = {

val filePath = "/Users/whitney/test/README.md" val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]") val sc = new SparkContext(conf) val minPartitions = 2 val logData = sc.textFile(filePath, minPartitions) val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } }

Sample Program Using Spark SQL

24

import org.apache.spark.sql.SparkSession import org.apache.spark.SparkConf

object SparkTest { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setAppName(“Word Count") conf.setMaster("local[2]") val spark = SparkSession .builder() .config(conf) .getOrCreate();

val filePath = "/Users/whitney/test/README.md" val textFile = spark.read.textFile(filePath) val linesWithSpark = textFile.filter(line => line.contains("Spark")) val sparkCount = linesWIthSpark.count() println(sparkCount) } }

Output

25

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/09/24 19:53:17 INFO SparkContext: Running Spark version 2.2.0 17/09/24 19:53:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/09/24 19:53:19 INFO SparkContext: Submitted application: Datasets Test 17/09/24 19:53:19 INFO SecurityManager: Changing view acls to: whitney 17/09/24 19:53:19 INFO SecurityManager: Changing modify acls to: whitney 17/09/24 19:53:19 INFO SecurityManager: Changing view acls groups to: 17/09/24 19:53:19 INFO SecurityManager: Changing modify acls groups to: 17/09/24 19:53:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(whitney); groups with view permissions: Set(); users with modify permissions: Set(whitney); groups with modify permissions: Set() 17/09/24 19:53:20 INFO Utils: Successfully started service 'sparkDriver' on port 61753. 17/09/24 19:53:20 INFO SparkEnv: Registering MapOutputTracker 17/09/24 19:53:20 INFO SparkEnv: Registering BlockManagerMaster 17/09/24 19:53:20 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 17/09/24 19:53:20 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 17/09/24 19:53:20 INFO DiskBlockManager: Created local directory at /private/var/folders/br/q_fcsjqc8xj9qn0059bctj3h0000gr/T/blockmgr-45dac785-124b-4203-a39e-895866d2cca3 17/09/24 19:53:20 INFO MemoryStore: MemoryStore started with capacity 912.3 MB 17/09/24 19:53:20 INFO SparkEnv: Registering OutputCommitCoordinator 17/09/24 19:53:20 INFO Utils: Successfully started service 'SparkUI' on port 4040. 17/09/24 19:53:21 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.1.9:4040 17/09/24 19:53:21 INFO Executor: Starting executor ID driver on host localhost 17/09/24 19:53:21 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 61754.

DataFrame, DataSet & RDD

26

What are they

What is the difference

When do use which one

Which languages can use them

DataFrame

27

+-------+----+ | name| age| +-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+

Table with rows and Columns

Schema Column labels Column types

Row org.apache.spark.sql.Row

Partitioner Distributes DataFrame among cluster

Plan Series of transformations to perform on DataFrame

Langauges Scala, Java, JVM languages, Python, R

Optimized Spark Catalyst Optimizer

DataSet

28

Same as DataFrame except for Rows

Programmer defines Row class Scala Cas Class Java Bean

Difference from DataFrame Compiler knows column names and column types in DataSet

Compile time error checking

Better data layout

Languages Scala, Java, JVM languages

RDD

29

+-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+

Table No information about types

No compile time or runtime type checking

Far fewer optimizations No Catalyst Optimizer No space optimization

Example - Same data RDD 33.3 MB DataFrame 7.3 MB

Languages Java, Scala Python, R - not recommended

Shares same basic operations as DataFrames & DataSets

Spark Types

30

Java Types are not space efficient “abcd” - 48 bytes

Spark has its own types

Special memory representation of each type Space efficient Cache aware

Spark Scala Python Python API

ByteType Byte int or long ByteType()

ShortType Short int or long ShortType()

IntegerType Int int or long IntergerType()

LongType Long int or long LongType()

Structured verses Unstructured

31

Structured = DataSet, DataFrame

Unstructured = RDD

Typed verses Untyped

32

Typed = DataSet

Untyped = DataFrame

Some Sample Data

33

{"ORIGIN_COUNTRY_NAME":"Romania","DEST_COUNTRY_NAME":"United States","count":15} {"ORIGIN_COUNTRY_NAME":"Croatia","DEST_COUNTRY_NAME":"United States","count":1} {"ORIGIN_COUNTRY_NAME":"Ireland","DEST_COUNTRY_NAME":"United States","count":344} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Egypt","count":15} {"ORIGIN_COUNTRY_NAME":"India","DEST_COUNTRY_NAME":"United States","count":62} {"ORIGIN_COUNTRY_NAME":"Singapore","DEST_COUNTRY_NAME":"United States","count":1} {"ORIGIN_COUNTRY_NAME":"Grenada","DEST_COUNTRY_NAME":"United States","count":62} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Costa Rica","count":588} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Senegal","count":40} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Moldova","count":1}

JSON flight Data 2015

United States Bureau of Transportation statistics

The Definitive Guide, Zaharia & Chambers, O’Reilly Media, Inc, 2017-10-??

2015-summary.json

34

scala> val jsonFlightFile = "/Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summary.json"

flightFile: String = /Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summary.json

scala> val flightData2015 = spark.read.json(jsonFlightFile) flightData2015: org.apache.spark.sql.DataFrame =

[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> flightData2015.take(2) res3: Array[org.apache.spark.sql.Row] =

Array([United States,Romania,15], [United States,Croatia,1])

Explain - Spark Plan

35

scala> flightData2015.explain() == Physical Plan == *FileScan json [DEST_COUNTRY_NAME#44,ORIGIN_COUNTRY_NAME#45,count#46L]

Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summ..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:bigint>

36

scala> val sortedFlightData2015 = flightData2015.sort("count") sortedFlightData2015: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> sortedFlightData2015.explain() == Physical Plan == *Sort [count#46L ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(count#46L ASC NULLS FIRST, 200) +- *FileScan json [DEST_COUNTRY_NAME#44,ORIGIN_COUNTRY_NAME#45,count#46L]

Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summ..., PartitionFilters: [], PushedFilters: [], ReadSchema:

struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:bigint>

scala> sortedFlightData2015.take(2) res6: Array[org.apache.spark.sql.Row] =

Array([United States,Singapore,1], [Moldova,United States,1])

Conceptual Plan

37

Lazy

Spark stores the plan in case it needs to recompute the result

Schema

38

scala> val jsonSchema = spark.read.json(jsonFlightFile).schema jsonSchema: org.apache.spark.sql.types.StructType =

StructType( StructField(DEST_COUNTRY_NAME,StringType,true), StructField(ORIGIN_COUNTRY_NAME,StringType,true), StructField(count,LongType,true))

scala> val flightData2015 = spark.read.json(jsonFlightFile) flightData2015: org.apache.spark.sql.DataFrame =

[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

StructField name The name of this field. dataType The data type of this field. nullable Indicates if values of this field can be null values. metadata

Schema

39

Spark was able to infer the schema since JSON object has labels and types

{"ORIGIN_COUNTRY_NAME":"Romania","DEST_COUNTRY_NAME":"United States","count":15}

Other data formats are less structured

Reading CSV

40

+-------+----+ | name| age| +-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+

name,age Andy,30 Justin,19 Michael,

people.csv

root |-- name: string (nullable = true) |-- age: integer (nullable = true)

scala> val peopleFile = “/Users/whitney/Courses/696/Fall17/SparkExamples/people.csv"

scala> val reader = spark.read reader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@288aaeaf

scala> reader.option(“header",true) scala> reader.option("inferSchema",true)

scala> val df = reader.csv(peopleFile) df: org.apache.spark.sql.DataFrame = [name: string, age: int]

Reading CSV

41

scala> df.show +-------+----+| name| age|+-------+----+| Andy| 30|| Justin| 19||Michael|null|+-------+----+

scala> df.schema res10: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,IntegerType,true))

scala> df.printSchema root |-- name: string (nullable = true) |-- age: integer (nullable = true)

Some CSV options

42

encoding sep (erator) header inferSchema ignoreLeadingWhiteSpace nullValue dateFormat timeStampFormat

mode PERMISSIVE - sets record field on corrupt record DROPMALFORMED - ignores whole corrupt records FAILFAST - throw exception on corrupt record

43

port org.apache.spark.sql.SparkSession import org.apache.spark.SparkConf

object PeopleExample { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setAppName("Datasets Test") conf.setMaster("local[2]") val spark = SparkSession .builder() .config(conf) .getOrCreate();

val peopleFile = "/Users/whitney/Courses/696/Fall17/SparkExamples/people.csv" val reader = spark.read reader.option("header", true) reader.option("inferSchema", true) val df = reader.csv(peopleFile) df.show df.printSchema spark.stop } }

Type Inference Issues

44

val reader = spark.read What type is reader?

scala> val reader: DataFrameReader = spark.read <console>:23: error: not found: type DataFrameReader val reader: DataFrameReader = spark.read

scala> val reader: org.apache.spark.sql.DataFrameReader = spark.read reader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@3982832f

scala> import org.apache.spark.sql.DataFrameReader import org.apache.spark.sql.DataFrameReader

scala> val reader: DataFrameReader = spark.read reader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@e6d4002

We can select columns

45


people.csv

scala> val names = df.select("name") names: org.apache.spark.sql.DataFrame = [name: string]

scala> names.show +-------+| name|+-------+| Andy|| Justin||Michael|+-------+

We can select columns

46


people.csv

scala> import org.apache.spark.sql.functions.col import org.apache.spark.sql.functions.col

scala> val names = df.select(col("name")) names: org.apache.spark.sql.DataFrame = [name: string]


If you Don’t lk Abrvtns

47


people.csv

scala> import org.apache.spark.sql.functions.column import org.apache.spark.sql.functions.col

scala> val names = df.select(column(“name")) names: org.apache.spark.sql.DataFrame = [name: string]


Column Operations

48

scala> val older = df.select(col("name"), col("age").plus(1)) older: org.apache.spark.sql.DataFrame = [name: string, (age + 1): int]

scala> older.show +-------+---------+ | name|(age + 1)| +-------+---------+ | Andy| 31| | Justin| 20| |Michael| null| +———+---------+

scala> older.printSchema root |-- name: string (nullable = true) |-- (age + 1): integer (nullable = true)

scala> val older = df.select($"name", $"age" + 1)

Column Operations - Java vs Scala

49

df.select(col("name"), col("age").plus(1))

df.select($"name", $"age" + 1)

Scala Only

Java or Scala

50

scala> val adult = older.filter($"age" > 21) scala> val adult = older.filter(col("age") > 21) scala> val adult = older.filter(col("age").gt(21)) adult: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [name: string, (age + 1): int]

scala> adult.show +----+---------+ |name|(age + 1)| +----+---------+ |Andy| 31| +----+---------+

scala> adult.explain == Physical Plan == *Project [name#104, (age#105 + 1) AS (age + 1)#123] +- *Filter (isnotnull(age#105) && (age#105 > 21)) +- *FileScan csv [name#104,age#105]

Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/whitney/Courses/696/Fall17/SparkExamples/people.csv], PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,21)], ReadSchema: struct<name:string,age:int>

51

scala> df.groupBy("age").count.show +----+-----+ | age|count| +----+-----+ |null| 1| | 19| 1| | 30| 1| +----+-----+

Saving DataFrames

52

{"name":"Andy","age":30} {"name":"Justin","age":19} {"name":"Michael"}

json, parquet, jdbc, orc, libsvm, csv, text

Formats

scala> df.write.format("json").save("people.json")

Produces a directory: people.json Contents:

_SUCCESS 0 Byte file

part-00000-71516d50-2bcc-4830-ad61-554d1c107f51-c000.json

Using SQL

53

scala> df.createOrReplaceTempView("people")

scala> val sqlExample = spark.sql("SELECT * FROM people") sqlExample: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> sqlExample.show +-------+----+ | name| age| +-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+

show

54

action No return value Only prints out value So can not use the result

What happens on cluster? Actions return value to master node But often run in batch mode

Collect - Returns a result

55

scala> val data = sqlExample.collect data: Array[org.apache.spark.sql.Row] = Array([Andy,30], [Justin,19], [Michael,null])

scala> data(0) res22: org.apache.spark.sql.Row = [Andy,30]

scala> data(0)(0) res23: Any = Andy

Documents

D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction