Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Big Data Analytics with CouchbaseHadoop, Kafka, Spark and More

Matt Ingenthron, Sr. Director

Michael Nitschinger, Software Engineer

2

• Define Problem Domain• How Couchbase fits in• Demo• Q&A

Agenda

4

Lambda Architecture

1

2

3

4

5

DATA

BATCH

SPEED

SERVE

QUERY

5

Lambda Architecture

Interactive and Real Time Applications

1

2

3

4

5

DATA

BATCH

SPEED

SERVE

QUERYHADOOP

COUCHBASESTORM

COUCHBASEBrokerCluster

Spout for Topic

Kafka Producers

Ordered Subscriptions

COMPLEX EVENT PROCESSING

Real TimeREPOSITORY

PERPETUALSTORE

ANALYTICALDB

BUSINESSINTELLIGENCE

MONITORING

CHAT/VOICESYSTEM

BATCHTRACK

REAL-TIMETRACK

DASHBOARD

TRACKING and COLLECTION

ANALYSIS ANDVISUALIZATION

REST FILTER METRICS

Integration at Scale

11

Requirements for data streaming in modern systems…

• Must support high throughput and low latency• Need to handle failures

• Pick up where you left off• Be efficient about resource usage

Data Sync is the Heart of Any Big Data System

Fundamental piece of the architecture- Data Sync maintains Data Redundancy for High

Availability (HA) & Disaster Recovery (DR)- Protect against failures – node, rack, region etc.

- Data Sync maintains Indexes- Indexing is key to building faster access paths to query data- Spatial, Full-text

DCP and Couchbase Server Architecture

14

What is DCP?DCP is an innovative protocol that drive data sync for Couchbase Server

• Increase data sync efficiency with massive data footprints • Remove slower Disk-IO from the data sync path• Improve latencies – replication for data durability• In future, will provide a programmable data sync protocol for

external stores outside Couchbase Server

DCP powers many critical components

What is DCP?

Demo

27

Shopper Tracking(click stream)

Lightweight Analytics:• Department shopped• Tech platform• Click tracks by Income

Heavier Analytics, Develop Profiles

28

29

And at scale…

Couchbase & Apache Spark

Introduction & Integration

What is Spark?

Apache is a fast and general engine for large-scale data processing.

Spark Components

Spark Core: RDDs, Clustering, Execution, Fault Management

Spark Components

Spark SQL: Work with structured data, distributed SQL querying

Spark Components

Spark Streaming: Build fault-tolerant streaming applications

Spark Components

Mlib: Machine Learning built in

Spark Components

GraphX: Graph processing and graph-parallel computations

Spark Benefits

• Linear Scalability• Ease of Use• Fault Tolerance

• For developers and data scientists• Tight but not mandatory Hadoop integration

Spark Facts

• Current Release: 1.3.0• Over 450 contributors, most active Apache Big Data project.• Huge public interest:

Source: http://www.google.com/trends/explore?hl=en-US#q=apache%20spark,%20apache%20hadoop&cmpt=q

Daytona GraySort Performance

Hadoop MR Record Spark Record

Data Size 102.5 TB 100 TB

Elapsed Time 72 mins 23 mins

# Nodes 2100 206

# Cores 50400 physical 6592 virtual

Cluster Disk Throughput 3150 GB/s 618 GB/s

Network Dedicated DC, 10Gbps EC2, 10Gbps

Sort Rate 1.42 TB/min 4.27 TB/min

Sort Rate/Node 0.67 GB/min 20.7 GB/min

Source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.htmlBenchmark: http://sortbenchmark.org/

http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

http://sortbenchmark.org/

http://sortbenchmark.org/

How does it work?

Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

RDDCreation

SchedulingDAG

TaskExecution

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf


How does it work?


RDDCreation

SchedulingDAG

TaskExecution



How does it work?


RDDCreation

SchedulingDAG

TaskExecution



Spark vs Hadoop

• Spark is RAM while Hadoop is HDFS (disk) bound

• API easier to reason about & to develop against

• Fully compatible with Hadoop Input/Output formats

• Hadoop more mature, Spark ecosystem growing fast

Ecosystem Flexibility

RDBMS

StreamsWeb APIs

DCPKVN1QLViews

BatchingArchived Data

OLTP

Infrastructure Consolidation

StreamsWeb APIs

UserInteraction

Couchbase Connector

Spark Core Automatic Cluster and Resource Management Creating and Persisting RDDs Java APIs in addition to Scala (planned before GA)

Spark SQL Easy JSON handling and querying Tight N1QL Integration (dp2)

Spark Streaming Persisting DStreams DCP source (planned before GA)

Connector Facts

• Current Version: 1.0.0-dp• DP2 upcoming• GA planned for Q3

Code:https://github.com/couchbaselabs/couchbase-spark-connector

Docs until GA: https://github.com/couchbaselabs/couchbase-spark-connector/wiki

https://github.com/couchbaselabs/couchbase-spark-connector



https://github.com/couchbaselabs/couchbase-spark-connector/wiki

https://github.com/couchbaselabs/couchbase-spark-connector/wiki

Questions

Matt Ingenthron @ingenthr

Michael Nitschinger @daschl

Thanks

Additional Slides

Use Case at Linkedin

52

• Site Reliability Engineer (SRE) at LinkedIn• SRE for Profile & Higher-Education• Member of LinkedIn’s CBVT• B.E. (Electrical Engineering) from

the University of Queensland,Australia

Michael Kehoe

• Kafka was created by LinkedIn• Kafka is a publish-subcribe system built as a distributed commit log• Processes 500+ TB/ day (~500 billion messages) @ LinkedIn

Kafka @ LinkedIn

• Monitoring• InGraphs

• Traditional Messaging (Pub-Sub)

• Analytics• Who Viewed my Profile• Experiment reports• Executive reports

• Building block for (log) distributibuted applications• Pinot• Espresso

LinkedIn’s uses of Kafka

Use Case: Kafka to Hadoop (Analytics)

• LinkedIn tracks data to better understand how members use our products

• Information such as which page got viewed and which content got clicked on are sent into a Kafka cluster in each data center

• Some of these events are all centrally collected and pushed onto our Hadoop grid for analy sis and daily report generation

Couchbase @ LinkedIn

• About 25 separate services with one or more clusters in multiple data centers

• Up to 100 servers in a cluster

• Single and Multi-tenant clusters

Use Case: Jobs Cluster

• Read scaling, Couchbase ~80k QPS, 24 server cluster(s)• Hadoop to pre-build data by partition• Couchbase 99 percentile latencies

Hadoop to Couchbase

• Our primary use-case for Hadoop Couchbase is for building (warming) / recovering Couchbase buckets

• LinkedIn built it’s own in-house solution to work with our ETL processes, cache invalidation procedures etc