Upload
others
View
31
Download
0
Embed Size (px)
Citation preview
CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2017
Doc 8 Spark Intro Modified Sep 26, 2017
Copyright ©, All rights reserved. 2017 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http://www.opencontent.org/opl.shtml) license defines the copyright on this document.
Spark
2
Created at UC Berkeley’s AMPLab
2009 Project started 2014 May Version 1.0 2016 July Version 2.0.2 2017 July Version 2.2.0
Programming interface for Java, Python, Scala, R
Interactive shell for Python, Scala, R (experimental)
Runs on Linux, Mac, Windows
Cluster manager Native Spark cluster Hadoop YARN Apache Mesos
File System HDFS MapR File System Cassandra OpenStack Swift S3
Pseudo-Distributed Mode Single machine Uses local file system
Python vs Scala on Spark
3
Scala is faster that Python But that is not so important here Most of the computation on Spark is done in Spark
Using Python with Spark Python data has to be
Converted between Python format and Scala/Java format Sent between Python process and JVM
Time Line
4
1991 - Java project started
1995 - Java 1.0 released, Design Patterns book published
2000 - Java 3
2001 - Scala project started
2002 - Nutch started
2004 - Google MapReduce paper
Scala version 1 released
2005 - F# released
2006 - Hadoop split from Nutch
Scala version 2 released
2007 - Clojure released
2009 - Spark project started
2012 - Hadoop 1.0
2014 - Spark 1.0
Major Parts of Spark
5
Spark Core Resilient Distributed Dataset (RDD)
Spark SQL SQL, csv, json Dataframe
Spark Streaming Near real-time response
MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction Optimization
GraphX
Spark
6
Ecosystem of packages, libraries and systems on top of Spark Core
Unstructured API Structured API
Resilient Distributed Datasets (RDD) Accumulators Broadcast variables
DataFrames Datasets Spark SQL
Newer, faster, higher level Preferred over Unstructured
Basic Architecture
7
Local Mode
8
Driver Process
Sparksession
UserCode
WorkerThreads
We will start using local mode
Use local mode to Develop Spark code
SparkContext
9
Connection to Spark cluster Runs on master node Used to create RDDs, accumulators, broadcast variables Only one SparkContext per JVM stop() the current SparkContext before starting another
SparkContext org.apache.spark.SparkContext Scala version
JavaSparkContext org.apache.spark.api.java.JavaSparkContext Java version
Entry point for Unstructured API
SparkSession
10
Contains a SparkContext
Entry point to use Dataset & DataFrame
Connection to Spark cluster Runs on master node
org.apache.spark.sql.SparkSession
Major Data Structures
11
Resilient Distributed Datasets (RDDs) Fault-tolerant collection of elements that can be operated on in parallel
Dataset & Dataframes Fault-tolerant collection of elements that can be operated on in parallel Rows & Columns JSON, csv, SQL tables Part of SparkSQL
Partitions
12
RDD & Dataset Divided into partitions Each partition is on different machine
Resilient & Distributed
13
Distributed Partitions on different machines
Resilient Each partition can be replicated on multiple machines Data structure knows how to reproduce operations
Basic Operations
14
RDDs, Dataframes, Datasets Immutable
Transformations Create new dataset (RDD) from existing one Lazy Only done when needed by an action Examples
map, filter, sample, union, distinct, groupByKey, repartition
Actions Return results to driver program Examples
reduce, collect, count, first, take
Actions & Transformations on DataSet
15
org.apache.spark.sql.Dataset
View the Spark Scala API
Starting Spark Scala REPL
16
./bin/spark-shell
From Spark installation
scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@508abc74
scala> spark res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@1c618295
Provided variables - Spark Scala REPL only
Starting spark shell
17
Al pro 9->spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/09/24 19:57:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/09/24 19:58:03 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Spark context Web UI available at http://192.168.1.9:4040 Spark context available as 'sc' (master = local[*], app id = local-1506308274361). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information.
Sample Interaction
18
where - Transformation count - Action
scala> val range = spark.range(100) range: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> val rangeWithLabel = range.toDF("number") rangeWithLabel: org.apache.spark.sql.DataFrame = [number: bigint]
scala> val divisibleBy2 = rangeWithLabel.where("number % 2 = 0") divisibleBy2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [number: bigint]
scala> divisibleBy2.count() res2: Long = 50
Sample Interaction
19
scala> val filePath = "/Users/whitney/test/README.md" filePath: String = /Users/whitney/test/README.md
scala> val textFile = spark.read.textFile(filePath) textFile: org.apache.spark.sql.Dataset[String] = [value: string]
scala> textFile.count() res3: Long = 103
scala> textFile.first() res4: String = # Apache Spark
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]
scala> linesWithSpark.count() res5: Long = 20
spark.read returns DataFrameReader
20
Can read cvs jdbc json ORC Parquet text
Setting number of Workers
21
./bin/spark-shell --master local[4]
Application using SBT
22
name := "Simple Project"
version := "1.0"
scalaVersion := “2.11.11"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
build.sbt
Sample Program Using RDDs
23
import org.apache.spark.{SparkConf, SparkContext}
object MasterConnect { def main(args: Array[String]): Unit = {
val filePath = "/Users/whitney/test/README.md" val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]") val sc = new SparkContext(conf) val minPartitions = 2 val logData = sc.textFile(filePath, minPartitions) val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } }
Sample Program Using Spark SQL
24
import org.apache.spark.sql.SparkSession import org.apache.spark.SparkConf
object SparkTest { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setAppName(“Word Count") conf.setMaster("local[2]") val spark = SparkSession .builder() .config(conf) .getOrCreate();
val filePath = "/Users/whitney/test/README.md" val textFile = spark.read.textFile(filePath) val linesWithSpark = textFile.filter(line => line.contains("Spark")) val sparkCount = linesWIthSpark.count() println(sparkCount) } }
Output
25
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/09/24 19:53:17 INFO SparkContext: Running Spark version 2.2.0 17/09/24 19:53:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/09/24 19:53:19 INFO SparkContext: Submitted application: Datasets Test 17/09/24 19:53:19 INFO SecurityManager: Changing view acls to: whitney 17/09/24 19:53:19 INFO SecurityManager: Changing modify acls to: whitney 17/09/24 19:53:19 INFO SecurityManager: Changing view acls groups to: 17/09/24 19:53:19 INFO SecurityManager: Changing modify acls groups to: 17/09/24 19:53:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(whitney); groups with view permissions: Set(); users with modify permissions: Set(whitney); groups with modify permissions: Set() 17/09/24 19:53:20 INFO Utils: Successfully started service 'sparkDriver' on port 61753. 17/09/24 19:53:20 INFO SparkEnv: Registering MapOutputTracker 17/09/24 19:53:20 INFO SparkEnv: Registering BlockManagerMaster 17/09/24 19:53:20 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 17/09/24 19:53:20 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 17/09/24 19:53:20 INFO DiskBlockManager: Created local directory at /private/var/folders/br/q_fcsjqc8xj9qn0059bctj3h0000gr/T/blockmgr-45dac785-124b-4203-a39e-895866d2cca3 17/09/24 19:53:20 INFO MemoryStore: MemoryStore started with capacity 912.3 MB 17/09/24 19:53:20 INFO SparkEnv: Registering OutputCommitCoordinator 17/09/24 19:53:20 INFO Utils: Successfully started service 'SparkUI' on port 4040. 17/09/24 19:53:21 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.1.9:4040 17/09/24 19:53:21 INFO Executor: Starting executor ID driver on host localhost 17/09/24 19:53:21 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 61754.
DataFrame, DataSet & RDD
26
What are they
What is the difference
When do use which one
Which languages can use them
DataFrame
27
+-------+----+ | name| age| +-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+
Table with rows and Columns
Schema Column labels Column types
Row org.apache.spark.sql.Row
Partitioner Distributes DataFrame among cluster
Plan Series of transformations to perform on DataFrame
Langauges Scala, Java, JVM languages, Python, R
Optimized Spark Catalyst Optimizer
DataSet
28
Same as DataFrame except for Rows
Programmer defines Row class Scala Cas Class Java Bean
Difference from DataFrame Compiler knows column names and column types in DataSet
Compile time error checking
Better data layout
Languages Scala, Java, JVM languages
RDD
29
+-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+
Table No information about types
No compile time or runtime type checking
Far fewer optimizations No Catalyst Optimizer No space optimization
Example - Same data RDD 33.3 MB DataFrame 7.3 MB
Languages Java, Scala Python, R - not recommended
Shares same basic operations as DataFrames & DataSets
Spark Types
30
Java Types are not space efficient “abcd” - 48 bytes
Spark has its own types
Special memory representation of each type Space efficient Cache aware
Spark Scala Python Python API
ByteType Byte int or long ByteType()
ShortType Short int or long ShortType()
IntegerType Int int or long IntergerType()
LongType Long int or long LongType()
Structured verses Unstructured
31
Structured = DataSet, DataFrame
Unstructured = RDD
Typed verses Untyped
32
Typed = DataSet
Untyped = DataFrame
Some Sample Data
33
{"ORIGIN_COUNTRY_NAME":"Romania","DEST_COUNTRY_NAME":"United States","count":15} {"ORIGIN_COUNTRY_NAME":"Croatia","DEST_COUNTRY_NAME":"United States","count":1} {"ORIGIN_COUNTRY_NAME":"Ireland","DEST_COUNTRY_NAME":"United States","count":344} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Egypt","count":15} {"ORIGIN_COUNTRY_NAME":"India","DEST_COUNTRY_NAME":"United States","count":62} {"ORIGIN_COUNTRY_NAME":"Singapore","DEST_COUNTRY_NAME":"United States","count":1} {"ORIGIN_COUNTRY_NAME":"Grenada","DEST_COUNTRY_NAME":"United States","count":62} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Costa Rica","count":588} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Senegal","count":40} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Moldova","count":1}
JSON flight Data 2015
United States Bureau of Transportation statistics
The Definitive Guide, Zaharia & Chambers, O’Reilly Media, Inc, 2017-10-??
2015-summary.json
34
scala> val jsonFlightFile = "/Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summary.json"
flightFile: String = /Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summary.json
scala> val flightData2015 = spark.read.json(jsonFlightFile) flightData2015: org.apache.spark.sql.DataFrame =
[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
scala> flightData2015.take(2) res3: Array[org.apache.spark.sql.Row] =
Array([United States,Romania,15], [United States,Croatia,1])
Explain - Spark Plan
35
scala> flightData2015.explain() == Physical Plan == *FileScan json [DEST_COUNTRY_NAME#44,ORIGIN_COUNTRY_NAME#45,count#46L]
Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summ..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:bigint>
36
scala> val sortedFlightData2015 = flightData2015.sort("count") sortedFlightData2015: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
scala> sortedFlightData2015.explain() == Physical Plan == *Sort [count#46L ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(count#46L ASC NULLS FIRST, 200) +- *FileScan json [DEST_COUNTRY_NAME#44,ORIGIN_COUNTRY_NAME#45,count#46L]
Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summ..., PartitionFilters: [], PushedFilters: [], ReadSchema:
struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:bigint>
scala> sortedFlightData2015.take(2) res6: Array[org.apache.spark.sql.Row] =
Array([United States,Singapore,1], [Moldova,United States,1])
Conceptual Plan
37
Lazy
Spark stores the plan in case it needs to recompute the result
Schema
38
scala> val jsonSchema = spark.read.json(jsonFlightFile).schema jsonSchema: org.apache.spark.sql.types.StructType =
StructType( StructField(DEST_COUNTRY_NAME,StringType,true), StructField(ORIGIN_COUNTRY_NAME,StringType,true), StructField(count,LongType,true))
scala> val flightData2015 = spark.read.json(jsonFlightFile) flightData2015: org.apache.spark.sql.DataFrame =
[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
StructField name The name of this field. dataType The data type of this field. nullable Indicates if values of this field can be null values. metadata
Schema
39
Spark was able to infer the schema since JSON object has labels and types
{"ORIGIN_COUNTRY_NAME":"Romania","DEST_COUNTRY_NAME":"United States","count":15}
Other data formats are less structured
Reading CSV
40
+-------+----+ | name| age| +-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+
name,age Andy,30 Justin,19 Michael,
people.csv
root |-- name: string (nullable = true) |-- age: integer (nullable = true)
scala> val peopleFile = “/Users/whitney/Courses/696/Fall17/SparkExamples/people.csv"
scala> val reader = spark.read reader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@288aaeaf
scala> reader.option(“header",true) scala> reader.option("inferSchema",true)
scala> val df = reader.csv(peopleFile) df: org.apache.spark.sql.DataFrame = [name: string, age: int]
Reading CSV
41
scala> df.show +-------+----+| name| age|+-------+----+| Andy| 30|| Justin| 19||Michael|null|+-------+----+
scala> df.schema res10: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,IntegerType,true))
scala> df.printSchema root |-- name: string (nullable = true) |-- age: integer (nullable = true)
Some CSV options
42
encoding sep (erator) header inferSchema ignoreLeadingWhiteSpace nullValue dateFormat timeStampFormat
mode PERMISSIVE - sets record field on corrupt record DROPMALFORMED - ignores whole corrupt records FAILFAST - throw exception on corrupt record
43
port org.apache.spark.sql.SparkSession import org.apache.spark.SparkConf
object PeopleExample { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setAppName("Datasets Test") conf.setMaster("local[2]") val spark = SparkSession .builder() .config(conf) .getOrCreate();
val peopleFile = "/Users/whitney/Courses/696/Fall17/SparkExamples/people.csv" val reader = spark.read reader.option("header", true) reader.option("inferSchema", true) val df = reader.csv(peopleFile) df.show df.printSchema spark.stop } }
Type Inference Issues
44
val reader = spark.read What type is reader?
scala> val reader: DataFrameReader = spark.read <console>:23: error: not found: type DataFrameReader val reader: DataFrameReader = spark.read
scala> val reader: org.apache.spark.sql.DataFrameReader = spark.read reader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@3982832f
scala> import org.apache.spark.sql.DataFrameReader import org.apache.spark.sql.DataFrameReader
scala> val reader: DataFrameReader = spark.read reader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@e6d4002
We can select columns
45
name,age Andy,30 Justin,19 Michael,
people.csv
scala> val names = df.select("name") names: org.apache.spark.sql.DataFrame = [name: string]
scala> names.show +-------+| name|+-------+| Andy|| Justin||Michael|+-------+
We can select columns
46
name,age Andy,30 Justin,19 Michael,
people.csv
scala> import org.apache.spark.sql.functions.col import org.apache.spark.sql.functions.col
scala> val names = df.select(col("name")) names: org.apache.spark.sql.DataFrame = [name: string]
scala> names.show +-------+| name|+-------+| Andy|| Justin||Michael|+-------+
If you Don’t lk Abrvtns
47
name,age Andy,30 Justin,19 Michael,
people.csv
scala> import org.apache.spark.sql.functions.column import org.apache.spark.sql.functions.col
scala> val names = df.select(column(“name")) names: org.apache.spark.sql.DataFrame = [name: string]
scala> names.show +-------+| name|+-------+| Andy|| Justin||Michael|+-------+
Column Operations
48
scala> val older = df.select(col("name"), col("age").plus(1)) older: org.apache.spark.sql.DataFrame = [name: string, (age + 1): int]
scala> older.show +-------+---------+ | name|(age + 1)| +-------+---------+ | Andy| 31| | Justin| 20| |Michael| null| +———+---------+
scala> older.printSchema root |-- name: string (nullable = true) |-- (age + 1): integer (nullable = true)
scala> val older = df.select($"name", $"age" + 1)
Column Operations - Java vs Scala
49
df.select(col("name"), col("age").plus(1))
df.select($"name", $"age" + 1)
Scala Only
Java or Scala
50
scala> val adult = older.filter($"age" > 21) scala> val adult = older.filter(col("age") > 21) scala> val adult = older.filter(col("age").gt(21)) adult: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [name: string, (age + 1): int]
scala> adult.show +----+---------+ |name|(age + 1)| +----+---------+ |Andy| 31| +----+---------+
scala> adult.explain == Physical Plan == *Project [name#104, (age#105 + 1) AS (age + 1)#123] +- *Filter (isnotnull(age#105) && (age#105 > 21)) +- *FileScan csv [name#104,age#105]
Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/whitney/Courses/696/Fall17/SparkExamples/people.csv], PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,21)], ReadSchema: struct<name:string,age:int>
51
scala> df.groupBy("age").count.show +----+-----+ | age|count| +----+-----+ |null| 1| | 19| 1| | 30| 1| +----+-----+
Saving DataFrames
52
{"name":"Andy","age":30} {"name":"Justin","age":19} {"name":"Michael"}
json, parquet, jdbc, orc, libsvm, csv, text
Formats
scala> df.write.format("json").save("people.json")
Produces a directory: people.json Contents:
_SUCCESS 0 Byte file
part-00000-71516d50-2bcc-4830-ad61-554d1c107f51-c000.json
Using SQL
53
scala> df.createOrReplaceTempView("people")
scala> val sqlExample = spark.sql("SELECT * FROM people") sqlExample: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> sqlExample.show +-------+----+ | name| age| +-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+
show
54
action No return value Only prints out value So can not use the result
What happens on cluster? Actions return value to master node But often run in batch mode
Collect - Returns a result
55
scala> val data = sqlExample.collect data: Array[org.apache.spark.sql.Row] = Array([Andy,30], [Justin,19], [Michael,null])
scala> data(0) res22: org.apache.spark.sql.Row = [Andy,30]
scala> data(0)(0) res23: Any = Andy