Upload
insight-technology-inc
View
650
Download
0
Embed Size (px)
Citation preview
Copyright2016 NTT corp. All Rights Reserved.
Spark Summit 2016SFSpark
2016/7/15 @
2Copyright2016 NTT corp. All Rights Reserved.
Who am I?
3Copyright2016 NTT corp. All Rights Reserved.
one-size-fits-many: APSQL, Streaming, MLlib, ...
Spark
4Copyright2016 NTT corp. All Rights Reserved.
SparkAPI
Spark RDD - Resilient Distributed Dataset
val data = Array(1, 2, 3, 4, 5) // Scala val dataRdd = sc.parallelize(data) // RDD val result = dataRdd.map(_ + 1) // RDD .fiter( _ % 2 == 0) .reduce(_ + _)
5Copyright2016 NTT corp. All Rights Reserved.
Driver ProgramSparkTaskWorker NodeExecutor
Spark
: Cluster Overview, http://spark.apache.org/docs/1.6.2/cluster-overview.html
6Copyright2016 NTT corp. All Rights Reserved.
1. Download Spark binary v1.6.2 http://spark.apache.org/downloads.html
2. Launch a Spark shell; .//bin/spark-shell
3. Do word-couting scala> val textFile = sc.textFile(hoge.txt") scala> val counts = textFile.flatMap(_.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
Quick Start!
7Copyright2016 NTT corp. All Rights Reserved.
2013.6 - Apache Incubator 2014.2 - IncubatorTop-Level 2014.6 - v1.0 2014.11 - v1.1 2014.12 - v1.2 2015.3 - v1.3 2015.5 - v1.4 2015.9 - v1.5 2016.1 - v1.6 2016.7 - v2.0
Spark Release History
8Copyright2016 NTT corp. All Rights Reserved.
DatabricsSpark 201372015EU
4
Spark Summit
Spark Summit 2016 @ SF
9Copyright2016 NTT corp. All Rights Reserved.
: 2500+720+*sold-out Spark Summit East 2016@1300+
10
IBMMS ColuderaHortonworksHadoop
Spark Summit 2016 @ SF 6/6-6/8
KEYNOTE 11%
ENTERPRISE 6%
DEVELOPER 15%
DATA SCIENCE 17%
RESEARH 18%
ECOSYSTEM 18%
USE CASE 15%
(101)
10Copyright2016 NTT corp. All Rights Reserved.
v2.0 ContinuousStructuring API)
Databricsco-founderCTO Matei Zaharia
KeynoteSpark 2.0
Catalyst
11Copyright2016 NTT corp. All Rights Reserved.
KeynoteDeep LearningAI
Jeff DeanGoogleAndrew NgBaiduIBM/Yahoo
Spark
Spark
IBM7Microsoft5 UberAirbnb PB/The Weather Company/IBM
12Copyright2016 NTT corp. All Rights Reserved.
Spark StreamingApache Kafka*
Apache HBaseApache Cassandra
Spark
*Apache Kafka,
Spark Streaming
13Copyright2016 NTT corp. All Rights Reserved.
Spark - Microsoft Bing
Top 5 Lessons Learned in Building Streaming Applications at Microsoft Bing Scale
Bing KafkaKafka Spark v1.6.xAP timeJoinAPI
Spark Streaming
14Copyright2016 NTT corp. All Rights Reserved.
Temporal Operators For Spark Streaming And Its Application For Office365 Service Monitoring
Join/AggregateTemporal
Spark - Microsoft Office365
15Copyright2016 NTT corp. All Rights Reserved.
The Internet of EverywhereHow IBM The Weather Company Scales
BI/Visualization
~360PB/
Spark - The Weather Company
* Spark Summit East 2016
16Copyright2016 NTT corp. All Rights Reserved.
Online Security Analytics on Large Scale Video Surveillance System
OpenCVDL4jDeepDist
Spark - EMC
17Copyright2016 NTT corp. All Rights Reserved.
Airstream: Spark Streaming At Airbnb
///HBase
Spark - Airbnb
18Copyright2016 NTT corp. All Rights Reserved.
SPARK
19Copyright2016 NTT corp. All Rights Reserved.
Parquet/ORCmap/shuffleSQL/DataFrameIF...
DB
JoinSPARK-16026:Cost-based optimizer framework
Spark
20Copyright2016 NTT corp. All Rights Reserved.
RDD
DBMS
Catalyst: Spark
Analyzer
Optimizer
SparkPlanner
: Deep Dive into Spark SQLs Catalyst Optimizer, https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
21Copyright2016 NTT corp. All Rights Reserved.
Filter->Join
SELECT * FROM a, b WHERE a.id == b.id AND a.value = 3; Join Filter
FilterJoin 1) Filtera aExecutor 2) Filtera abHash-JoinidExecutor
* ab
22Copyright2016 NTT corp. All Rights Reserved.
Filter->Join
SELECT * FROM a, b WHERE a.id == b.id AND a.value = 3; Join Filter
FilterJoin 1) Filtera aExecutor 2) Filtera abHash-JoinidExecutor
2) a.value=3
* ab
23Copyright2016 NTT corp. All Rights Reserved.
24Copyright2016 NTT corp. All Rights Reserved.
...java
Spark
Analyzer
Optimizer
SparkPlanner
: Deep Dive into Spark SQLs Catalyst Optimizer, https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
25Copyright2016 NTT corp. All Rights Reserved.
Whole-stage Codegen spark.sql.codegen.wholeStageon/offon
Spark
26Copyright2016 NTT corp. All Rights Reserved.
Volcano-StyleCPU
getNext()
Volcano-Style
getNext()1 100Scan Filter100
getNext()
getNext()
27Copyright2016 NTT corp. All Rights Reserved.
Cited from the article of A Look Back at Single-Threaded CPU Performance, http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance
SPECint SPECfp
2004/1
CPU
28Copyright2016 NTT corp. All Rights Reserved.
CPU
forSIMD
CPU
29Copyright2016 NTT corp. All Rights Reserved.
Thomas Neumann, Efficiently Compiling Efficient Query Plans for Modern Hardware, VLDB11
ScanFilter...produceconsume2IF
produce: produceconsume
consume:consume
Spark
30Copyright2016 NTT corp. All Rights Reserved.
Spark
consume ()
consume ()
consume ()
produce()
produce()
produce()
doCodegen()
- .produce: .child.produce() - .consume: print count += 1
- .produce: .child.produce() - .consume: .parent.consume()
- .produce: .child.produce() - .consume: print if (ss_item_sk == 1000) { ${.parent.consume()} }
- scan.produce: print for (ss_item_sk in store_sales) { ${scan.parent.consume()} }
31Copyright2016 NTT corp. All Rights Reserved.
Spark
doCodegen()
32Copyright2016 NTT corp. All Rights Reserved.
Spark
SPARK
scala> import org.apache.spark.sql.execution.debug._ scala> val df = sql("SELECT sqrt(a) FROM test WHERE b = 1) scala> df.explain
== Physical Plan == *Project [SQRT(cast(_1#2 as double)) AS SQRT(CAST(a AS DOUBLE))#18] +- *Filter (_2#3 = 1) +- LocalTableScan [_1#2, _2#3]
scala> df.debugCodegen
33Copyright2016 NTT corp. All Rights Reserved.Spark
34Copyright2016 NTT corp. All Rights Reserved.
1970IBM SystemR; SEQUEL
1980C/Pascal
2000DBMSpointer injection
35Copyright2016 NTT corp. All Rights Reserved.
2000Java
2010Vector-at-a-timeMicro-Specialization
Amazon Redshift, SQL Server, Presto, Hive, Spark, Impala, MemSQLSQL
36Copyright2016 NTT corp. All Rights Reserved.
build/sbt "sql/test-only *BenchmarkWholeStageCodegen
37Copyright2016 NTT corp. All Rights Reserved.
build/sbt "sql/test-only *BenchmarkWholeStageCodegen
38Copyright2016 NTT corp. All Rights Reserved.
build/sbt "sql/test-only *BenchmarkWholeStageCodegen
39Copyright2016 NTT corp. All Rights Reserved.
TPCDS
Cited from the slide of Spark Performance: What's Next in SPARK SUMMIT EAST16, http://www.slideshare.net/databricks/spark-performance-whats-next#28