34
SPARK SUMMIT EUROPE 2016 SPARK AND COUCHBASE AUGMENTING THE OPERATIONAL DATABASE WITH SPARK Michael Nitschinger Couchbase

Spark Summit EU talk by Michael Nitschinger

Embed Size (px)

Citation preview

Page 1: Spark Summit EU talk by Michael Nitschinger

SPARK SUMMIT EUROPE 2016

SPARK AND COUCHBASE AUGMENTING THE OPERATIONAL DATABASE WITH SPARK

Michael NitschingerCouchbase

Page 2: Spark Summit EU talk by Michael Nitschinger
Page 3: Spark Summit EU talk by Michael Nitschinger

WHY SPARK AND COUCHBASEOverview & Use-Cases

Page 4: Spark Summit EU talk by Michael Nitschinger

Use CasesOperations Analytics

CB

§ Recommendations§ Predictive analytics§ Fraud detection

§ Catalog § Personalization§ Mobile applications

Page 5: Spark Summit EU talk by Michael Nitschinger

Use Case: Operationalize Analytics / ML

Hadoop

ML Model

Data Warehouse

Training Data

CB

ModelOnline Data

Serving

Predictions

Page 6: Spark Summit EU talk by Michael Nitschinger

Adapted from: Databricks – Not Your Father’s Database https://www.brighttalk.com/webcast/12891/196891

Page 7: Spark Summit EU talk by Michael Nitschinger

Use Case: Data Integration

RDBMSS3HDFS ES

NoSQL

Page 8: Spark Summit EU talk by Michael Nitschinger

Standalone Deployment

Page 9: Spark Summit EU talk by Michael Nitschinger

Side-By-Side Deployment

Page 10: Spark Summit EU talk by Michael Nitschinger

ACCESS PATTERNSFrom Spark to Couchbase and Back Again

Page 11: Spark Summit EU talk by Michael Nitschinger

Key-Value

Fetch/Store byDocument ID

Page 12: Spark Summit EU talk by Michael Nitschinger

Key-Value

Fetch/Store byDocument ID

N1QL Query

Fetch byCriteria “SQL”

Page 13: Spark Summit EU talk by Michael Nitschinger

Key-Value

Fetch/Store byDocument ID

N1QL Query

Fetch byCriteria “SQL”

Map-Reduce Views

Materialized Indexes (Aggregation)

Page 14: Spark Summit EU talk by Michael Nitschinger

Key-Value

Fetch/Store byDocument ID

N1QL Query

Fetch byCriteria “SQL”

Map-Reduce Views

Materialized Indexes (Aggregation)

Streaming

Mutation StreamsFor Processing

Page 15: Spark Summit EU talk by Michael Nitschinger

Key-Value

Fetch/Store byDocument ID

N1QL Query

Fetch byCriteria “SQL”

Map-Reduce Views

Materialized Indexes (Aggregation)

Streaming

Mutation StreamsFor Processing

Full Text

Search on Freeform Text

Page 16: Spark Summit EU talk by Michael Nitschinger

Key-Value

Fetch/Store byDocument ID

N1QL Query

Fetch byCriteria “SQL”

Map-Reduce Views

Materialized Indexes (Aggregation)

Streaming

Mutation StreamsFor Processing

Page 17: Spark Summit EU talk by Michael Nitschinger

Couchbase Data Partitioning

Page 18: Spark Summit EU talk by Michael Nitschinger

Data Locality

• RDD Location Hints based on the Cluster Map

• Not available for N1QL or Views– Round robin - can’t give location hints– Back end is scatter gather with 1 node responding

Page 19: Spark Summit EU talk by Michael Nitschinger

N1QL Query

• N1QL is a SQL service with JSON extensions

• Uses Couchbase’s Global Secondary Indexes

• Can run on any nodes within the cluster

• Nodes with differing services can be added and removed as needed on the fly

Page 20: Spark Summit EU talk by Michael Nitschinger

Data Service

Projector & Router

Couchbase Query ArchitectureQuery ServiceIndex Service

SupervisorIndex maintenance &

Scan coordinator

Index#2Index#1

Query Processorcbq-engine

Bucket#1 Bucket#2

DCP StreamIndex#4Index#3

...Bucket#2

Bucket#1

ForestDBStorage Engine

Page 21: Spark Summit EU talk by Michael Nitschinger

Spark SQL Sources

TableScanScan all of the data and return it

PrunedScanScan an index that matches only relevant data to the query at hand.

PrunedFilteredScanScan an index that matches only relevant data to the query at hand.

Page 22: Spark Summit EU talk by Michael Nitschinger

Predicate Conversion

Page 23: Spark Summit EU talk by Michael Nitschinger

Schema Inference

Page 24: Spark Summit EU talk by Michael Nitschinger

Schema Inference

N1QLRelation:28 - Inferring schema from bucket travel-sample with query 'SELECT META(`travel-sample`).id as `META_ID`, `travel-sample`.* FROM `travel-sample` WHERE `type` = 'airline' LIMIT 1000'

N1QLRelation:28 - Executing generated query: 'SELECT `name`,`callsign` FROM `travel-sample` WHERE `type` = 'airline''

Page 25: Spark Summit EU talk by Michael Nitschinger

Schema Inference

Page 26: Spark Summit EU talk by Michael Nitschinger

DCP and Spark Streaming

ReplicaIndexing

Page 27: Spark Summit EU talk by Michael Nitschinger

Structured Streaming Source

27Adapted from https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

DCP Stream Unbounded Table

Page 28: Spark Summit EU talk by Michael Nitschinger

28

(Un)Structured Streaming?

Page 29: Spark Summit EU talk by Michael Nitschinger

29

Structured Streaming Source

Page 30: Spark Summit EU talk by Michael Nitschinger

30

Structured Streaming Sink

Page 31: Spark Summit EU talk by Michael Nitschinger

Couchbase Spark Connector 1.2.1

• Spark 1.6.x support, including Datasets• DCP Flow Control• Enhanced Java APIs

31

Page 32: Spark Summit EU talk by Michael Nitschinger

Couchbase Spark Connector 2.0.0

• Spark 2.0.x Support• Enhanced DCP Client• Experimental Structured Streaming

32

Page 33: Spark Summit EU talk by Michael Nitschinger

Resources• Spark Packages

https://spark-packages.org/package/couchbase/couchbase-spark-connector

• Docs http://docs.couchbase.com

• Source https://github.com/couchbase/couchbase-spark-connector

• Bugs https://issues.couchbase.com/browse/SPARKC

33

Page 34: Spark Summit EU talk by Michael Nitschinger

SPARK SUMMIT EUROPE 2016

THANK YOU.Michael [email protected]@couchbase.com