39
Copyright©2016 NTT corp. All Rights Reserved. Spark Summit 2016@SFに参加してきたので 最新事などを紹介しつつデータベース技術から みたSparkの新しさを紹介 2016/7/15 室 健 @ 本電信電話株式会社

[db tech showcase Tokyo 2016] B31: Spark Summit 2016@SFに参加してきたので最新事例などを紹介しつつデータベース技術からみたSparkの新しさを紹介

Embed Size (px)

Citation preview

  • Copyright2016 NTT corp. All Rights Reserved.

    Spark Summit 2016SFSpark

    2016/7/15 @

  • 2Copyright2016 NTT corp. All Rights Reserved.

    Who am I?

  • 3Copyright2016 NTT corp. All Rights Reserved.

    one-size-fits-many: APSQL, Streaming, MLlib, ...

    Spark

  • 4Copyright2016 NTT corp. All Rights Reserved.

    SparkAPI

    Spark RDD - Resilient Distributed Dataset

    val data = Array(1, 2, 3, 4, 5) // Scala val dataRdd = sc.parallelize(data) // RDD val result = dataRdd.map(_ + 1) // RDD .fiter( _ % 2 == 0) .reduce(_ + _)

  • 5Copyright2016 NTT corp. All Rights Reserved.

    Driver ProgramSparkTaskWorker NodeExecutor

    Spark

    : Cluster Overview, http://spark.apache.org/docs/1.6.2/cluster-overview.html

  • 6Copyright2016 NTT corp. All Rights Reserved.

    1. Download Spark binary v1.6.2 http://spark.apache.org/downloads.html

    2. Launch a Spark shell; .//bin/spark-shell

    3. Do word-couting scala> val textFile = sc.textFile(hoge.txt") scala> val counts = textFile.flatMap(_.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)

    Quick Start!

  • 7Copyright2016 NTT corp. All Rights Reserved.

    2013.6 - Apache Incubator 2014.2 - IncubatorTop-Level 2014.6 - v1.0 2014.11 - v1.1 2014.12 - v1.2 2015.3 - v1.3 2015.5 - v1.4 2015.9 - v1.5 2016.1 - v1.6 2016.7 - v2.0

    Spark Release History

  • 8Copyright2016 NTT corp. All Rights Reserved.

    DatabricsSpark 201372015EU

    4

    Spark Summit

    Spark Summit 2016 @ SF

  • 9Copyright2016 NTT corp. All Rights Reserved.

    : 2500+720+*sold-out Spark Summit East 2016@1300+

    10

    IBMMS ColuderaHortonworksHadoop

    Spark Summit 2016 @ SF 6/6-6/8

    KEYNOTE 11%

    ENTERPRISE 6%

    DEVELOPER 15%

    DATA SCIENCE 17%

    RESEARH 18%

    ECOSYSTEM 18%

    USE CASE 15%

    (101)

  • 10Copyright2016 NTT corp. All Rights Reserved.

    v2.0 ContinuousStructuring API)

    Databricsco-founderCTO Matei Zaharia

    KeynoteSpark 2.0

    Catalyst

  • 11Copyright2016 NTT corp. All Rights Reserved.

    KeynoteDeep LearningAI

    Jeff DeanGoogleAndrew NgBaiduIBM/Yahoo

    Spark

    Spark

    IBM7Microsoft5 UberAirbnb PB/The Weather Company/IBM

  • 12Copyright2016 NTT corp. All Rights Reserved.

    Spark StreamingApache Kafka*

    Apache HBaseApache Cassandra

    Spark

    *Apache Kafka,

    Spark Streaming

  • 13Copyright2016 NTT corp. All Rights Reserved.

    Spark - Microsoft Bing

    Top 5 Lessons Learned in Building Streaming Applications at Microsoft Bing Scale

    Bing KafkaKafka Spark v1.6.xAP timeJoinAPI

    Spark Streaming

  • 14Copyright2016 NTT corp. All Rights Reserved.

    Temporal Operators For Spark Streaming And Its Application For Office365 Service Monitoring

    Join/AggregateTemporal

    Spark - Microsoft Office365

  • 15Copyright2016 NTT corp. All Rights Reserved.

    The Internet of EverywhereHow IBM The Weather Company Scales

    BI/Visualization

    ~360PB/

    Spark - The Weather Company

    * Spark Summit East 2016

  • 16Copyright2016 NTT corp. All Rights Reserved.

    Online Security Analytics on Large Scale Video Surveillance System

    OpenCVDL4jDeepDist

    Spark - EMC

  • 17Copyright2016 NTT corp. All Rights Reserved.

    Airstream: Spark Streaming At Airbnb

    ///HBase

    Spark - Airbnb

  • 18Copyright2016 NTT corp. All Rights Reserved.

    SPARK

  • 19Copyright2016 NTT corp. All Rights Reserved.

    Parquet/ORCmap/shuffleSQL/DataFrameIF...

    DB

    JoinSPARK-16026:Cost-based optimizer framework

    Spark

  • 20Copyright2016 NTT corp. All Rights Reserved.

    RDD

    DBMS

    Catalyst: Spark

    Analyzer

    Optimizer

    SparkPlanner

    : Deep Dive into Spark SQLs Catalyst Optimizer, https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

  • 21Copyright2016 NTT corp. All Rights Reserved.

    Filter->Join

    SELECT * FROM a, b WHERE a.id == b.id AND a.value = 3; Join Filter

    FilterJoin 1) Filtera aExecutor 2) Filtera abHash-JoinidExecutor

    * ab

  • 22Copyright2016 NTT corp. All Rights Reserved.

    Filter->Join

    SELECT * FROM a, b WHERE a.id == b.id AND a.value = 3; Join Filter

    FilterJoin 1) Filtera aExecutor 2) Filtera abHash-JoinidExecutor

    2) a.value=3

    * ab

  • 23Copyright2016 NTT corp. All Rights Reserved.

  • 24Copyright2016 NTT corp. All Rights Reserved.

    ...java

    Spark

    Analyzer

    Optimizer

    SparkPlanner

    : Deep Dive into Spark SQLs Catalyst Optimizer, https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

  • 25Copyright2016 NTT corp. All Rights Reserved.

    Whole-stage Codegen spark.sql.codegen.wholeStageon/offon

    Spark

  • 26Copyright2016 NTT corp. All Rights Reserved.

    Volcano-StyleCPU

    getNext()

    Volcano-Style

    getNext()1 100Scan Filter100

    getNext()

    getNext()

  • 27Copyright2016 NTT corp. All Rights Reserved.

    Cited from the article of A Look Back at Single-Threaded CPU Performance, http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance

    SPECint SPECfp

    2004/1

    CPU

  • 28Copyright2016 NTT corp. All Rights Reserved.

    CPU

    forSIMD

    CPU

  • 29Copyright2016 NTT corp. All Rights Reserved.

    Thomas Neumann, Efficiently Compiling Efficient Query Plans for Modern Hardware, VLDB11

    ScanFilter...produceconsume2IF

    produce: produceconsume

    consume:consume

    Spark

  • 30Copyright2016 NTT corp. All Rights Reserved.

    Spark

    consume ()

    consume ()

    consume ()

    produce()

    produce()

    produce()

    doCodegen()

    - .produce: .child.produce() - .consume: print count += 1

    - .produce: .child.produce() - .consume: .parent.consume()

    - .produce: .child.produce() - .consume: print if (ss_item_sk == 1000) { ${.parent.consume()} }

    - scan.produce: print for (ss_item_sk in store_sales) { ${scan.parent.consume()} }

  • 31Copyright2016 NTT corp. All Rights Reserved.

    Spark

    doCodegen()

  • 32Copyright2016 NTT corp. All Rights Reserved.

    Spark

    SPARK

    scala> import org.apache.spark.sql.execution.debug._ scala> val df = sql("SELECT sqrt(a) FROM test WHERE b = 1) scala> df.explain

    == Physical Plan == *Project [SQRT(cast(_1#2 as double)) AS SQRT(CAST(a AS DOUBLE))#18] +- *Filter (_2#3 = 1) +- LocalTableScan [_1#2, _2#3]

    scala> df.debugCodegen

  • 33Copyright2016 NTT corp. All Rights Reserved.Spark

  • 34Copyright2016 NTT corp. All Rights Reserved.

    1970IBM SystemR; SEQUEL

    1980C/Pascal

    2000DBMSpointer injection

  • 35Copyright2016 NTT corp. All Rights Reserved.

    2000Java

    2010Vector-at-a-timeMicro-Specialization

    Amazon Redshift, SQL Server, Presto, Hive, Spark, Impala, MemSQLSQL

  • 36Copyright2016 NTT corp. All Rights Reserved.

    build/sbt "sql/test-only *BenchmarkWholeStageCodegen

  • 37Copyright2016 NTT corp. All Rights Reserved.

    build/sbt "sql/test-only *BenchmarkWholeStageCodegen

  • 38Copyright2016 NTT corp. All Rights Reserved.

    build/sbt "sql/test-only *BenchmarkWholeStageCodegen

  • 39Copyright2016 NTT corp. All Rights Reserved.

    TPCDS

    Cited from the slide of Spark Performance: What's Next in SPARK SUMMIT EAST16, http://www.slideshare.net/databricks/spark-performance-whats-next#28