55
CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2017 Doc 8 Spark Intro Modified Sep 26, 2017 Copyright ©, All rights reserved. 2017 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http://www.opencontent.org/opl.shtml) license defines the copyright on this document.

D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

  • Upload
    others

  • View
    31

  • Download
    0

Embed Size (px)

Citation preview

Page 1: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2017

Doc 8 Spark Intro Modified Sep 26, 2017

Copyright ©, All rights reserved. 2017 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http://www.opencontent.org/opl.shtml) license defines the copyright on this document.

Page 2: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Spark

2

Created at UC Berkeley’s AMPLab

2009 Project started 2014 May Version 1.0 2016 July Version 2.0.2 2017 July Version 2.2.0

Programming interface for Java, Python, Scala, R

Interactive shell for Python, Scala, R (experimental)

Runs on Linux, Mac, Windows

Cluster manager Native Spark cluster Hadoop YARN Apache Mesos

File System HDFS MapR File System Cassandra OpenStack Swift S3

Pseudo-Distributed Mode Single machine Uses local file system

Page 3: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Python vs Scala on Spark

3

Scala is faster that Python But that is not so important here Most of the computation on Spark is done in Spark

Using Python with Spark Python data has to be

Converted between Python format and Scala/Java format Sent between Python process and JVM

Page 4: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Time Line

4

1991 - Java project started

1995 - Java 1.0 released, Design Patterns book published

2000 - Java 3

2001 - Scala project started

2002 - Nutch started

2004 - Google MapReduce paper

Scala version 1 released

2005 - F# released

2006 - Hadoop split from Nutch

Scala version 2 released

2007 - Clojure released

2009 - Spark project started

2012 - Hadoop 1.0

2014 - Spark 1.0

Page 5: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Major Parts of Spark

5

Spark Core Resilient Distributed Dataset (RDD)

Spark SQL SQL, csv, json Dataframe

Spark Streaming Near real-time response

MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction Optimization

GraphX

Page 6: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Spark

6

Ecosystem of packages, libraries and systems on top of Spark Core

Unstructured API Structured API

Resilient Distributed Datasets (RDD) Accumulators Broadcast variables

DataFrames Datasets Spark SQL

Newer, faster, higher level Preferred over Unstructured

Page 7: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Basic Architecture

7

Page 8: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Local Mode

8

Driver Process

Sparksession

UserCode

WorkerThreads

We will start using local mode

Use local mode to Develop Spark code

Page 9: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

SparkContext

9

Connection to Spark cluster Runs on master node Used to create RDDs, accumulators, broadcast variables Only one SparkContext per JVM stop() the current SparkContext before starting another

SparkContext org.apache.spark.SparkContext Scala version

JavaSparkContext org.apache.spark.api.java.JavaSparkContext Java version

Entry point for Unstructured API

Page 10: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

SparkSession

10

Contains a SparkContext

Entry point to use Dataset & DataFrame

Connection to Spark cluster Runs on master node

org.apache.spark.sql.SparkSession

Page 11: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Major Data Structures

11

Resilient Distributed Datasets (RDDs) Fault-tolerant collection of elements that can be operated on in parallel

Dataset & Dataframes Fault-tolerant collection of elements that can be operated on in parallel Rows & Columns JSON, csv, SQL tables Part of SparkSQL

Page 12: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Partitions

12

RDD & Dataset Divided into partitions Each partition is on different machine

Page 13: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Resilient & Distributed

13

Distributed Partitions on different machines

Resilient Each partition can be replicated on multiple machines Data structure knows how to reproduce operations

Page 14: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Basic Operations

14

RDDs, Dataframes, Datasets Immutable

Transformations Create new dataset (RDD) from existing one Lazy Only done when needed by an action Examples

map, filter, sample, union, distinct, groupByKey, repartition

Actions Return results to driver program Examples

reduce, collect, count, first, take

Page 15: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Actions & Transformations on DataSet

15

org.apache.spark.sql.Dataset

View the Spark Scala API

Page 16: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Starting Spark Scala REPL

16

./bin/spark-shell

From Spark installation

scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@508abc74

scala> spark res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@1c618295

Provided variables - Spark Scala REPL only

Page 17: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Starting spark shell

17

Al pro 9->spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/09/24 19:57:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/09/24 19:58:03 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Spark context Web UI available at http://192.168.1.9:4040 Spark context available as 'sc' (master = local[*], app id = local-1506308274361). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information.

Page 18: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Sample Interaction

18

where - Transformation count - Action

scala> val range = spark.range(100) range: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> val rangeWithLabel = range.toDF("number") rangeWithLabel: org.apache.spark.sql.DataFrame = [number: bigint]

scala> val divisibleBy2 = rangeWithLabel.where("number % 2 = 0") divisibleBy2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [number: bigint]

scala> divisibleBy2.count() res2: Long = 50

Page 19: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Sample Interaction

19

scala> val filePath = "/Users/whitney/test/README.md" filePath: String = /Users/whitney/test/README.md

scala> val textFile = spark.read.textFile(filePath) textFile: org.apache.spark.sql.Dataset[String] = [value: string]

scala> textFile.count() res3: Long = 103

scala> textFile.first() res4: String = # Apache Spark

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]

scala> linesWithSpark.count() res5: Long = 20

Page 20: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

spark.read returns DataFrameReader

20

Can read cvs jdbc json ORC Parquet text

Page 21: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Setting number of Workers

21

./bin/spark-shell --master local[4]

Page 22: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Application using SBT

22

name := "Simple Project"

version := "1.0"

scalaVersion := “2.11.11"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"

build.sbt

Page 23: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Sample Program Using RDDs

23

import org.apache.spark.{SparkConf, SparkContext}

object MasterConnect { def main(args: Array[String]): Unit = {

val filePath = "/Users/whitney/test/README.md" val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]") val sc = new SparkContext(conf) val minPartitions = 2 val logData = sc.textFile(filePath, minPartitions) val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } }

Page 24: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Sample Program Using Spark SQL

24

import org.apache.spark.sql.SparkSession import org.apache.spark.SparkConf

object SparkTest { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setAppName(“Word Count") conf.setMaster("local[2]") val spark = SparkSession .builder() .config(conf) .getOrCreate();

val filePath = "/Users/whitney/test/README.md" val textFile = spark.read.textFile(filePath) val linesWithSpark = textFile.filter(line => line.contains("Spark")) val sparkCount = linesWIthSpark.count() println(sparkCount) } }

Page 25: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Output

25

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/09/24 19:53:17 INFO SparkContext: Running Spark version 2.2.0 17/09/24 19:53:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/09/24 19:53:19 INFO SparkContext: Submitted application: Datasets Test 17/09/24 19:53:19 INFO SecurityManager: Changing view acls to: whitney 17/09/24 19:53:19 INFO SecurityManager: Changing modify acls to: whitney 17/09/24 19:53:19 INFO SecurityManager: Changing view acls groups to: 17/09/24 19:53:19 INFO SecurityManager: Changing modify acls groups to: 17/09/24 19:53:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(whitney); groups with view permissions: Set(); users with modify permissions: Set(whitney); groups with modify permissions: Set() 17/09/24 19:53:20 INFO Utils: Successfully started service 'sparkDriver' on port 61753. 17/09/24 19:53:20 INFO SparkEnv: Registering MapOutputTracker 17/09/24 19:53:20 INFO SparkEnv: Registering BlockManagerMaster 17/09/24 19:53:20 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 17/09/24 19:53:20 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 17/09/24 19:53:20 INFO DiskBlockManager: Created local directory at /private/var/folders/br/q_fcsjqc8xj9qn0059bctj3h0000gr/T/blockmgr-45dac785-124b-4203-a39e-895866d2cca3 17/09/24 19:53:20 INFO MemoryStore: MemoryStore started with capacity 912.3 MB 17/09/24 19:53:20 INFO SparkEnv: Registering OutputCommitCoordinator 17/09/24 19:53:20 INFO Utils: Successfully started service 'SparkUI' on port 4040. 17/09/24 19:53:21 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.1.9:4040 17/09/24 19:53:21 INFO Executor: Starting executor ID driver on host localhost 17/09/24 19:53:21 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 61754.

Page 26: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

DataFrame, DataSet & RDD

26

What are they

What is the difference

When do use which one

Which languages can use them

Page 27: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

DataFrame

27

+-------+----+ | name| age| +-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+

Table with rows and Columns

Schema Column labels Column types

Row org.apache.spark.sql.Row

Partitioner Distributes DataFrame among cluster

Plan Series of transformations to perform on DataFrame

Langauges Scala, Java, JVM languages, Python, R

Optimized Spark Catalyst Optimizer

Page 28: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

DataSet

28

Same as DataFrame except for Rows

Programmer defines Row class Scala Cas Class Java Bean

Difference from DataFrame Compiler knows column names and column types in DataSet

Compile time error checking

Better data layout

Languages Scala, Java, JVM languages

Page 29: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

RDD

29

+-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+

Table No information about types

No compile time or runtime type checking

Far fewer optimizations No Catalyst Optimizer No space optimization

Example - Same data RDD 33.3 MB DataFrame 7.3 MB

Languages Java, Scala Python, R - not recommended

Shares same basic operations as DataFrames & DataSets

Page 30: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Spark Types

30

Java Types are not space efficient “abcd” - 48 bytes

Spark has its own types

Special memory representation of each type Space efficient Cache aware

Spark Scala Python Python API

ByteType Byte int or long ByteType()

ShortType Short int or long ShortType()

IntegerType Int int or long IntergerType()

LongType Long int or long LongType()

Page 31: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Structured verses Unstructured

31

Structured = DataSet, DataFrame

Unstructured = RDD

Page 32: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Typed verses Untyped

32

Typed = DataSet

Untyped = DataFrame

Page 33: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Some Sample Data

33

{"ORIGIN_COUNTRY_NAME":"Romania","DEST_COUNTRY_NAME":"United States","count":15} {"ORIGIN_COUNTRY_NAME":"Croatia","DEST_COUNTRY_NAME":"United States","count":1} {"ORIGIN_COUNTRY_NAME":"Ireland","DEST_COUNTRY_NAME":"United States","count":344} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Egypt","count":15} {"ORIGIN_COUNTRY_NAME":"India","DEST_COUNTRY_NAME":"United States","count":62} {"ORIGIN_COUNTRY_NAME":"Singapore","DEST_COUNTRY_NAME":"United States","count":1} {"ORIGIN_COUNTRY_NAME":"Grenada","DEST_COUNTRY_NAME":"United States","count":62} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Costa Rica","count":588} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Senegal","count":40} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Moldova","count":1}

JSON flight Data 2015

United States Bureau of Transportation statistics

The Definitive Guide, Zaharia & Chambers, O’Reilly Media, Inc, 2017-10-??

2015-summary.json

Page 34: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

34

scala> val jsonFlightFile = "/Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summary.json"

flightFile: String = /Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summary.json

scala> val flightData2015 = spark.read.json(jsonFlightFile) flightData2015: org.apache.spark.sql.DataFrame =

[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> flightData2015.take(2) res3: Array[org.apache.spark.sql.Row] =

Array([United States,Romania,15], [United States,Croatia,1])

Page 35: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Explain - Spark Plan

35

scala> flightData2015.explain() == Physical Plan == *FileScan json [DEST_COUNTRY_NAME#44,ORIGIN_COUNTRY_NAME#45,count#46L]

Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summ..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:bigint>

Page 36: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

36

scala> val sortedFlightData2015 = flightData2015.sort("count") sortedFlightData2015: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> sortedFlightData2015.explain() == Physical Plan == *Sort [count#46L ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(count#46L ASC NULLS FIRST, 200) +- *FileScan json [DEST_COUNTRY_NAME#44,ORIGIN_COUNTRY_NAME#45,count#46L]

Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summ..., PartitionFilters: [], PushedFilters: [], ReadSchema:

struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:bigint>

scala> sortedFlightData2015.take(2) res6: Array[org.apache.spark.sql.Row] =

Array([United States,Singapore,1], [Moldova,United States,1])

Page 37: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Conceptual Plan

37

Lazy

Spark stores the plan in case it needs to recompute the result

Page 38: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Schema

38

scala> val jsonSchema = spark.read.json(jsonFlightFile).schema jsonSchema: org.apache.spark.sql.types.StructType =

StructType( StructField(DEST_COUNTRY_NAME,StringType,true), StructField(ORIGIN_COUNTRY_NAME,StringType,true), StructField(count,LongType,true))

scala> val flightData2015 = spark.read.json(jsonFlightFile) flightData2015: org.apache.spark.sql.DataFrame =

[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

StructField name The name of this field. dataType The data type of this field. nullable Indicates if values of this field can be null values. metadata

Page 39: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Schema

39

Spark was able to infer the schema since JSON object has labels and types

{"ORIGIN_COUNTRY_NAME":"Romania","DEST_COUNTRY_NAME":"United States","count":15}

Other data formats are less structured

Page 40: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Reading CSV

40

+-------+----+ | name| age| +-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+

name,age Andy,30 Justin,19 Michael,

people.csv

root |-- name: string (nullable = true) |-- age: integer (nullable = true)

scala> val peopleFile = “/Users/whitney/Courses/696/Fall17/SparkExamples/people.csv"

scala> val reader = spark.read reader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@288aaeaf

scala> reader.option(“header",true) scala> reader.option("inferSchema",true)

scala> val df = reader.csv(peopleFile) df: org.apache.spark.sql.DataFrame = [name: string, age: int]

Page 41: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Reading CSV

41

scala> df.show +-------+----+| name| age|+-------+----+| Andy| 30|| Justin| 19||Michael|null|+-------+----+

scala> df.schema res10: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,IntegerType,true))

scala> df.printSchema root |-- name: string (nullable = true) |-- age: integer (nullable = true)

Page 42: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Some CSV options

42

encoding sep (erator) header inferSchema ignoreLeadingWhiteSpace nullValue dateFormat timeStampFormat

mode PERMISSIVE - sets record field on corrupt record DROPMALFORMED - ignores whole corrupt records FAILFAST - throw exception on corrupt record

Page 43: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

43

port org.apache.spark.sql.SparkSession import org.apache.spark.SparkConf

object PeopleExample { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setAppName("Datasets Test") conf.setMaster("local[2]") val spark = SparkSession .builder() .config(conf) .getOrCreate();

val peopleFile = "/Users/whitney/Courses/696/Fall17/SparkExamples/people.csv" val reader = spark.read reader.option("header", true) reader.option("inferSchema", true) val df = reader.csv(peopleFile) df.show df.printSchema spark.stop } }

Page 44: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Type Inference Issues

44

val reader = spark.read What type is reader?

scala> val reader: DataFrameReader = spark.read <console>:23: error: not found: type DataFrameReader val reader: DataFrameReader = spark.read

scala> val reader: org.apache.spark.sql.DataFrameReader = spark.read reader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@3982832f

scala> import org.apache.spark.sql.DataFrameReader import org.apache.spark.sql.DataFrameReader

scala> val reader: DataFrameReader = spark.read reader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@e6d4002

Page 45: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

We can select columns

45

name,age Andy,30 Justin,19 Michael,

people.csv

scala> val names = df.select("name") names: org.apache.spark.sql.DataFrame = [name: string]

scala> names.show +-------+| name|+-------+| Andy|| Justin||Michael|+-------+

Page 46: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

We can select columns

46

name,age Andy,30 Justin,19 Michael,

people.csv

scala> import org.apache.spark.sql.functions.col import org.apache.spark.sql.functions.col

scala> val names = df.select(col("name")) names: org.apache.spark.sql.DataFrame = [name: string]

scala> names.show +-------+| name|+-------+| Andy|| Justin||Michael|+-------+

Page 47: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

If you Don’t lk Abrvtns

47

name,age Andy,30 Justin,19 Michael,

people.csv

scala> import org.apache.spark.sql.functions.column import org.apache.spark.sql.functions.col

scala> val names = df.select(column(“name")) names: org.apache.spark.sql.DataFrame = [name: string]

scala> names.show +-------+| name|+-------+| Andy|| Justin||Michael|+-------+

Page 48: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Column Operations

48

scala> val older = df.select(col("name"), col("age").plus(1)) older: org.apache.spark.sql.DataFrame = [name: string, (age + 1): int]

scala> older.show +-------+---------+ | name|(age + 1)| +-------+---------+ | Andy| 31| | Justin| 20| |Michael| null| +———+---------+

scala> older.printSchema root |-- name: string (nullable = true) |-- (age + 1): integer (nullable = true)

scala> val older = df.select($"name", $"age" + 1)

Page 49: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Column Operations - Java vs Scala

49

df.select(col("name"), col("age").plus(1))

df.select($"name", $"age" + 1)

Scala Only

Java or Scala

Page 50: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

50

scala> val adult = older.filter($"age" > 21) scala> val adult = older.filter(col("age") > 21) scala> val adult = older.filter(col("age").gt(21)) adult: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [name: string, (age + 1): int]

scala> adult.show +----+---------+ |name|(age + 1)| +----+---------+ |Andy| 31| +----+---------+

scala> adult.explain == Physical Plan == *Project [name#104, (age#105 + 1) AS (age + 1)#123] +- *Filter (isnotnull(age#105) && (age#105 > 21)) +- *FileScan csv [name#104,age#105]

Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/whitney/Courses/696/Fall17/SparkExamples/people.csv], PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,21)], ReadSchema: struct<name:string,age:int>

Page 51: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

51

scala> df.groupBy("age").count.show +----+-----+ | age|count| +----+-----+ |null| 1| | 19| 1| | 30| 1| +----+-----+

Page 52: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Saving DataFrames

52

{"name":"Andy","age":30} {"name":"Justin","age":19} {"name":"Michael"}

json, parquet, jdbc, orc, libsvm, csv, text

Formats

scala> df.write.format("json").save("people.json")

Produces a directory: people.json Contents:

_SUCCESS 0 Byte file

part-00000-71516d50-2bcc-4830-ad61-554d1c107f51-c000.json

Page 53: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Using SQL

53

scala> df.createOrReplaceTempView("people")

scala> val sqlExample = spark.sql("SELECT * FROM people") sqlExample: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> sqlExample.show +-------+----+ | name| age| +-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+

Page 54: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

show

54

action No return value Only prints out value So can not use the result

What happens on cluster? Actions return value to master node But often run in batch mode

Page 55: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction

Collect - Returns a result

55

scala> val data = sqlExample.collect data: Array[org.apache.spark.sql.Row] = Array([Andy,30], [Justin,19], [Michael,null])

scala> data(0) res22: org.apache.spark.sql.Row = [Andy,30]

scala> data(0)(0) res23: Any = Andy