Spark Summit EU talk by Michael Nitschinger

SPARK SUMMIT EUROPE 2016

SPARK AND COUCHBASE AUGMENTING THE OPERATIONAL DATABASE WITH SPARK

Michael NitschingerCouchbase

WHY SPARK AND COUCHBASEOverview & Use-Cases

Use CasesOperations Analytics

§ Recommendations§ Predictive analytics§ Fraud detection

§ Catalog § Personalization§ Mobile applications

Use Case: Operationalize Analytics / ML

Hadoop

ML Model

Data Warehouse

Training Data

ModelOnline Data

Serving

Predictions

Adapted from: Databricks – Not Your Father’s Database https://www.brighttalk.com/webcast/12891/196891

Use Case: Data Integration

RDBMSS3HDFS ES

Standalone Deployment

Side-By-Side Deployment

ACCESS PATTERNSFrom Spark to Couchbase and Back Again

Key-Value

Fetch/Store byDocument ID

Key-Value

N1QL Query

Fetch byCriteria “SQL”

Key-Value

N1QL Query

Map-Reduce Views

Materialized Indexes (Aggregation)

Key-Value

N1QL Query

Map-Reduce Views

Streaming

Mutation StreamsFor Processing

Key-Value

N1QL Query

Map-Reduce Views

Streaming

Full Text

Search on Freeform Text

Key-Value

N1QL Query

Map-Reduce Views

Streaming

Couchbase Data Partitioning

Data Locality

• RDD Location Hints based on the Cluster Map

• Not available for N1QL or Views– Round robin - can’t give location hints– Back end is scatter gather with 1 node responding

N1QL Query

• N1QL is a SQL service with JSON extensions

• Uses Couchbase’s Global Secondary Indexes

• Can run on any nodes within the cluster

• Nodes with differing services can be added and removed as needed on the fly

Data Service

Projector & Router

Couchbase Query ArchitectureQuery ServiceIndex Service

SupervisorIndex maintenance &

Scan coordinator

Index#2Index#1

Query Processorcbq-engine

Bucket#1 Bucket#2

DCP StreamIndex#4Index#3

...Bucket#2

Bucket#1

ForestDBStorage Engine

Spark SQL Sources

TableScanScan all of the data and return it

PrunedScanScan an index that matches only relevant data to the query at hand.

PrunedFilteredScanScan an index that matches only relevant data to the query at hand.

Predicate Conversion

Schema Inference

N1QLRelation:28 - Inferring schema from bucket travel-sample with query 'SELECT META(`travel-sample`).id as `META_ID`, `travel-sample`.* FROM `travel-sample` WHERE `type` = 'airline' LIMIT 1000'

N1QLRelation:28 - Executing generated query: 'SELECT `name`,`callsign` FROM `travel-sample` WHERE `type` = 'airline''

Schema Inference

DCP and Spark Streaming

ReplicaIndexing

Structured Streaming Source

27Adapted from https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

DCP Stream Unbounded Table

(Un)Structured Streaming?

Structured Streaming Source

Structured Streaming Sink

Couchbase Spark Connector 1.2.1

• Spark 1.6.x support, including Datasets• DCP Flow Control• Enhanced Java APIs

Couchbase Spark Connector 2.0.0

• Spark 2.0.x Support• Enhanced DCP Client• Experimental Structured Streaming

Resources• Spark Packages

https://spark-packages.org/package/couchbase/couchbase-spark-connector

• Docs http://docs.couchbase.com

• Source https://github.com/couchbase/couchbase-spark-connector

• Bugs https://issues.couchbase.com/browse/SPARKC

SPARK SUMMIT EUROPE 2016

THANK YOU.Michael Nitschinger@daschlmichael.nitschinger@couchbase.com

Spark Summit EU talk by Michael Nitschinger

Data & Analytics

Spark Summit EU talk by Johnathan Mercer

Spark Summit EU talk by Tim Hunter

Spark Summit EU talk by Zoltan Zvara

Spark Summit EU talk by Jiri Simsa

Spark + Flashblade: Spark Summit East talk by Brian Gold

Spark Summit EU talk by Nick Pentreath

Spark Summit EU talk by Ted Malaska

Spark Autotuning: Spark Summit East talk by Lawrence Spracklen

Spark Summit EU talk by Simon Whitear

Spark Summit EU talk by Elena Lazovik

Spark Summit EU talk by Casey Stella

Spark Summit EU talk by Qifan Pu

Spark Summit EU talk by Ahsan Javed Awan

Spark Summit EU talk by William Benton

Spark Summit EU talk by Tug Grall

Spark Summit EU talk by Berni Schiefer

Spark Summit EU talk by Oscar Castaneda

Spark Summit EU talk by Emlyn Whittick

Spark Summit EU talk by Bas Geerdink

Spark Summit EU talk by Luca Canali