41
Phoenix Hadoop Users Group Couchbase Data Pipeline

Couchbase Data Pipeline

Embed Size (px)

Citation preview

Page 1: Couchbase Data Pipeline

Phoenix Hadoop Users GroupCouchbase Data Pipeline

Page 2: Couchbase Data Pipeline

2

Agenda

Couchbase Technology What is Couchbase Couchbase and Hadoop Ecosystem Architecture (Node/SDK/Cluster)

Couchbase at PayPal Couchbase Deployment Use Case Overview

Kafka Connector Demo

Page 3: Couchbase Data Pipeline

What is Couchbase?

Page 4: Couchbase Data Pipeline

4

High availability

cache

Key-value store

Document

database

Embedded database

Sync management

Couchbase Server

Couchbase Lite

CouchbaseSync Gateway

Data management for a broad range of use cases

Page 5: Couchbase Data Pipeline

5

Couchbase Tenants

Flexible data model

Consistent performance at scale

High availability

Easy, affordable scalability

24x365

Page 6: Couchbase Data Pipeline

Couchbase and Hadoop Ecosystem

Page 7: Couchbase Data Pipeline

7

Couchbase View

NoSQL Hadoop NoSQL Hadoop

Overlap Compliment

NoSQL or Hadoop? NoSQL and Hadoop.

Page 8: Couchbase Data Pipeline

8

Couchbase View

Couchbase Spark Hadoop (Hive)Use cases • Operational

• Web / Mobile • Analytics• Machine

Learning

• Analytics• Machine

LearningProcessing mode

• Online • Ad Hoc (New!)

• Streaming• Ad Hoc • Batch

• Batch• Ad Hoc

Low latency = < 1ms ops Seconds MinutesUsers are typically

Millions of customers

100’s of analysts

100’s of analysts

Big data = 10s of Terabytes Petabytes(?) PetabytesANALYTICAL

OPERATIONAL

Page 9: Couchbase Data Pipeline

9

Lambda Architecture

1

4

5

DATA

SERVE

QUERY

New Data Stream Analysis

All DataPrecompute

Views (Map Reduce)

Process Stream

Incremental Views

BatchRecompute

Real-TimeIncrement

Batch Layer

Serving Layer

Speed Layer

2 BATCH

3 SPEED

Page 10: Couchbase Data Pipeline

10

Couchbase and Hadoop

New Data Stream Merged View

All DataPrecompute

Views (Map Reduce)

Process Stream

Incremental Views

Partial Aggregat

e

Partial Aggregat

e

Partial Aggregat

e

Real-Time Data

BatchRecompute

Batch Views

Real-Time Views

Real-TimeIncrement

Merge

Batch Layer

Serving Layer

Speed Layer

Couchbase HadoopConnector (Sqoop)

Page 11: Couchbase Data Pipeline

11

Couchbase HadoopConnector (Sqoop)

Couchbase and Hadoop

New Data Stream Merged View

All DataPrecompute

Views (Map Reduce)

Process Stream

Incremental Views

Partial Aggregat

e

Partial Aggregat

e

Partial Aggregat

e

Real-Time Data

BatchRecompute

Batch Views

Real-Time Views

Real-TimeIncrement

Merge

Batch Layer

Serving Layer

Speed Layer

Stream / Data

IngestionStore

Incremental Data / Stream

processing

Serving merged results /

responses

Page 12: Couchbase Data Pipeline

12

Couchbase Connectors

Page 13: Couchbase Data Pipeline

13

Couchbase Connectors

xDBC

App

CB Node

xDBC

ETL

xDBC

BI

xDBC

Visualization

CB Node CB Node

Visualization

Integrations, partnerships

Page 14: Couchbase Data Pipeline

COMPLEX EVENT PROCESSING

Real TimeREPOSITORY

PERPETUALSTORE

ANALYTICALDB

BUSINESSINTELLIGENCE

MONITORING

CHAT/VOICE

SYSTEM

BATCHTRACK

REAL-TIMETRACK

DASHBOARD

Page 15: Couchbase Data Pipeline

ArchitectureCouchbase Node

Page 16: Couchbase Data Pipeline

16

Couchbase Server NodeSingle-node type means easier administration and scaling Single installation Two major

components/processes: Data manager cluster manager

Data manager: C/C++ Layer consolidation of caching

and persistence Cluster manager:

Erlang/OTP Administration UI’s Out-of-band for data requests

Page 17: Couchbase Data Pipeline

17

Couchbase Read OperationAPPLICATION SERVER

MANAGED CACHE

DISK

DISKQUEUE

REPLICATIONQUEUE

DOC 1

GETDOC 1

DOC 1

Single-node type means easier administration and scaling Reads out of cache are

extremely fast No other process/system to

communicate with Data connection is a TCP-

binary protocol

Page 18: Couchbase Data Pipeline

18

APPLICATION SERVER

MANAGED CACHE

DISK

DISKQUEUE

REPLICATIONQUEUE

Couchbase Write Operation

DOC 1

DOC 1DOC 1

Single-node type means easier administration and scaling Writes are async by default Application gets

acknowledgement when successfully in RAM and can trade-off waiting for replication or persistence per-write

Replication to 1, 2 or 3 other nodes

Replication is RAM-based so extremely fast

Off-node replication is primary level of HA

Disk written to as fast as possible – no waiting

Page 19: Couchbase Data Pipeline

19

Couchbase Cache EjectionAPPLICATION SERVER

MANAGED CACHE

DISK

DISKQUEUE

REPLICATIONQUEUE

DOC 1

DOC 2DOC 3DOC 4DOC 5

DOC 1

DOC 2 DOC 3 DOC 4 DOC 5

Single-node type means easier administration and scaling Layer consolidation means

read through and write through cache

Couchbase automatically removes data that has already been persisted from RAM

Page 20: Couchbase Data Pipeline

20

APPLICATION SERVER

MANAGED CACHE

DISK

DISKQUEUE

REPLICATIONQUEUE

DOC 1

Couchbase Cache Miss

DOC 2 DOC 3 DOC 4 DOC 5

DOC 2 DOC 3 DOC 4 DOC 5

GETDOC 1

DOC 1

DOC 1

Single-node type means easier administration and scaling Layer consolidation means

1 single interface for App to talk to and get its data back as fast as possible

Separation of cache and disk allows for fastest access out of RAM while pulling data from disk in parallel

Page 21: Couchbase Data Pipeline

ArchitectureCouchbase SDK

Page 22: Couchbase Data Pipeline

22

Documents are integral to the SDKs. All SDK’s support JSON format In addition: Serialized objects, Unquoted Strings,

Binary pass-through A Document contains:

Couchbase SDK

22

Property Description

ID The bucket-unique identifierContent The value that is storedExpiry An expiration timeCAS Check-and-Set identifier

Page 23: Couchbase Data Pipeline

23

Couchbase SDKWhat does it mean to be a Couchbase SDK?

Cluster

Bucket

CRUD View Query

N1QL Query

FunctionManage connections to the bucket within the cluster for different services.Provide a core layer where IO can be managed and optimized.Provide a way to manage buckets.

APIinsertDesignDocument()flush()listDesignDocuments()

FunctionHold on to cluster information such as topology.

APIReference Cluster ManagementopenBucket()info()disconnect()

FunctionGive the application developer a concurrent API for basic (k-v) or document management

APIget()insert()upsert()remove()

FunctionAllow for querying, execution of other directives such as defining indexes and checking on index state.

APIabucket.NewN1QLQuery( “SELECT * FROM default LIMIT 5” ) .Consistency(gocouchbase.RequestPlus);

FunctionAllow for view querying, building of queries and reasonable error handling from the cluster.

APIabucket.NewViewQuery().Limit().Stale()

Page 24: Couchbase Data Pipeline

24

Couchbase SDK Official SDKs

Java .NET Node.js Python

For each of these we have Full Document support Interoperability Common yet idiomatic Programming Model

Others: Erlang, Perl, TCL, Clojure, Scala

PHP C / C++ Go Ruby

JDBC and ODBC

Page 25: Couchbase Data Pipeline

ArchitectureCouchbase Cluster: Node and SDK Interaction

Page 26: Couchbase Data Pipeline

26

ACTIVE ACTIVE ACTIVE

REPLICA REPLICA REPLICA

Couchbase Server 1 Couchbase Server 2 Couchbase Server 3

Basic Operation

SHARD5

SHARD2

SHARD9

SHARD SHARD SHARD

SHARD4

SHARD7

SHARD8

SHARD SHARD SHARD

SHARD1

SHARD3

SHARD6

SHARD SHARD SHARD

SHARD4

SHARD1

SHARD8

SHARD SHARD SHARD

SHARD6

SHARD3

SHARD2

SHARD SHARD SHARD

SHARD7

SHARD9

SHARD5

SHARD SHARD SHARD

Application has single logical connection to cluster (client

object) Data is automatically sharded resulting in

even document data distribution across cluster

Each vbucket replicated 1, 2 or 3 times (“peer-to-peer” replication)

Docs are automatically hashed by the client to a shard’

Cluster map provides location of which server a shard is on

Every read/write/update/delete goes to same node for a given key

Strongly consistent data access (“read your own writes”)

A single Couchbase node can achieve 100k’s ops/sec so no need to scale reads

Page 27: Couchbase Data Pipeline

27

Auto sharding – Bucket and vBuckets

vB

Data buckets

vB

1 ….. 1024

Virtual buckets

A bucket is a logical, unique key space Multiple buckets can exist within a single cluster of

nodes

Each bucket has active and replica data sets (1, 2 or 3 extra copies)

Each data set has 1024 Virtual Buckets (vBuckets) Each vBucket contains 1/1024th portion of the data set vBuckets do not have a fixed physical server location

Mapping between the vBuckets and physical servers is called the cluster map

Document IDs (keys) always get hashed to the same vbucket

Couchbase SDK’s lookup the vbucket -> server mapping

Page 28: Couchbase Data Pipeline

28

Cluster Map

Page 29: Couchbase Data Pipeline

29

Cluster Map

Page 30: Couchbase Data Pipeline

30

Cluster Map – 2 nodes added

Page 31: Couchbase Data Pipeline

31

Rebalance

ACTIVE ACTIVE ACTIVE

REPLICA REPLICA REPLICA

Couchbase Server 1 Couchbase Server 2 Couchbase Server 3

ACTIVE ACTIVE

REPLICA REPLICA

Couchbase Server 4 Couchbase Server 5

SHARD5

SHARD2

SHARD SHARD

SHARD4

SHARD SHARD

SHARD1

SHARD3

SHARD SHARD

SHARD4

SHARD1

SHARD8

SHARD SHARD SHARD

SHARD6

SHARD3

SHARD2

SHARD SHARD SHARD

SHARD7

SHARD9

SHARD5

SHARD SHARD SHARD

SHARD7

SHARD

SHARD6

SHARD

SHARD8

SHARD9

SHARD

READ/WRITE/UPDATE

Application has single logical connection to cluster (client object) Multiple nodes added

or removed at once One-click operation Incremental

movement of active and replica vbuckets and data

Client library updated via cluster map

Fully online operation, no downtime or loss of performance

Page 32: Couchbase Data Pipeline

32

Fail Over Node

ACTIVE ACTIVE ACTIVE

REPLICA REPLICA REPLICA

Couchbase Server 1 Couchbase Server 2 Couchbase Server 3

ACTIVE ACTIVE

REPLICA REPLICA

Couchbase Server 4 Couchbase Server 5

SHARD5

SHARD2

SHARD SHARD

SHARD4

SHARD SHARD

SHARD1

SHARD3

SHARD SHARD

SHARD4

SHARD1

SHARD8

SHARD SHARD

SHARDSHARD6

SHARD2

SHARD SHARD SHARD

SHARD7

SHARD9

SHARD5

SHARD SHARD

SHARD

SHARD7

SHARD

SHARD6

SHARDSHARD8

SHARD9

SHARD

SHARD3

SHARD1

SHARD3

SHARD

Application has single logical connection to cluster (client object) When node goes down,

some requests will fail Failover is either

automatic or manual` Client library is

automatically updated via cluster map

Replicas not recreated to preserve stability

Best practice to replace node and rebalance

Page 33: Couchbase Data Pipeline

Couchbase at PayPalKafka Integration

Page 34: Couchbase Data Pipeline

34

Couchbase at PayPal

34

Footprint Overview Seven use cases (more going live at later date) Each cluster is 10 to 20 nodes per cluster Three data center locations per use case

Global Cookie Service Three clusters (two handle traffic, one for DR) Bi-Directional Replication Billions of Documents TB of Data (Maximum of 10 over time)

Challenge Data Analytics

Page 35: Couchbase Data Pipeline

35

Couchbase at PayPal

35

Couchbase Solution Couchbase Server deployed to

capture and serve global cookies Integrates with Hadoop to pass data

for additional offline analytics via Kafka

Results Consistent low latency

SLA 10ms application SLA 1ms Couchbase

High availability enabled by distributed cache and data center replication

Kafka integration for analytics within Hadoop cluster

Page 36: Couchbase Data Pipeline

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

Aug/Sep Oct Nov DecMonth Month MonthMonth

36

Data volume/ Scalability• Online system ; >1B documents

• 4-10k size ; 5-10TB total storage

• Linearly Scalable

Availability• Multi data center – DR

• Availability requirement of

99.99%

Requirements for Database

Data Structure• Flexible & Schema less; document

based

Performance• 50% read/50% write;

• Low latency < 10 msec (5)

Page 37: Couchbase Data Pipeline

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary. 37

Couchbase TAP

• Snapshot Entire Database• Export Future mutations• TAP observe data changes in memcached

server

• Kafka - A high-throughput distributed messaging system.

Couchbase Kafka AdapterBased on Couchbase Tap & Kafka Producer

Kafka Producer

Fast

Scalable

Durable

Distributed

https://github.com/paypal/couchbasekafka

Page 38: Couchbase Data Pipeline

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

Stream data out of databasehttps://github.com/paypal/couchbasekafka

38

Camus , MR Jobs

TAP StreamCouchbase Kafka Adapter

{TAP Client + Kafka Producer}

[1] [2] [3]

[4][5][6]

[7]

Page 39: Couchbase Data Pipeline

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

CookieApp

CookieApp

CookieApp

XDCR

Active

WriteRead

39

Bi-directional Uni-directional

Active Passive

Deployment Model

Page 40: Couchbase Data Pipeline

DemoConnector … http://blog.couchbase.com/introducing-the-couchbase-

kafka-connector

Bits … https://github.com/couchbase/couchbase-kafka-connector

Page 41: Couchbase Data Pipeline

Thank You