Upload
couchbase
View
253
Download
1
Tags:
Embed Size (px)
Citation preview
Big Data Analytics with CouchbaseHadoop, Kafka, Spark and More
Matt Ingenthron, Sr. Director
Michael Nitschinger, Software Engineer
2
• Define Problem Domain• How Couchbase fits in• Demo• Q&A
Agenda
4
Lambda Architecture
1
2
3
4
5
DATA
BATCH
SPEED
SERVE
QUERY
5
Lambda Architecture
Interactive and Real Time Applications
1
2
3
4
5
DATA
BATCH
SPEED
SERVE
QUERYHADOOP
COUCHBASESTORM
COUCHBASEBrokerCluster
Spout for Topic
Kafka Producers
Ordered Subscriptions
COMPLEX EVENT PROCESSING
Real TimeREPOSITORY
PERPETUALSTORE
ANALYTICALDB
BUSINESSINTELLIGENCE
MONITORING
CHAT/VOICESYSTEM
BATCHTRACK
REAL-TIMETRACK
DASHBOARD
TRACKING and COLLECTION
ANALYSIS ANDVISUALIZATION
REST FILTER METRICS
Integration at Scale
11
Requirements for data streaming in modern systems…
• Must support high throughput and low latency• Need to handle failures
• Pick up where you left off• Be efficient about resource usage
Data Sync is the Heart of Any Big Data System
Fundamental piece of the architecture- Data Sync maintains Data Redundancy for High
Availability (HA) & Disaster Recovery (DR)- Protect against failures – node, rack, region etc.
- Data Sync maintains Indexes- Indexing is key to building faster access paths to query data- Spatial, Full-text
DCP and Couchbase Server Architecture
14
What is DCP?DCP is an innovative protocol that drive data sync for Couchbase Server
• Increase data sync efficiency with massive data footprints • Remove slower Disk-IO from the data sync path• Improve latencies – replication for data durability• In future, will provide a programmable data sync protocol for
external stores outside Couchbase Server
DCP powers many critical components
What is DCP?
Demo
27
Shopper Tracking(click stream)
Lightweight Analytics:• Department shopped• Tech platform• Click tracks by Income
Heavier Analytics, Develop Profiles
28
29
And at scale…
Couchbase & Apache Spark
Introduction & Integration
What is Spark?
Apache is a fast and general engine for large-scale data processing.
Spark Components
Spark Core: RDDs, Clustering, Execution, Fault Management
Spark Components
Spark SQL: Work with structured data, distributed SQL querying
Spark Components
Spark Streaming: Build fault-tolerant streaming applications
Spark Components
Mlib: Machine Learning built in
Spark Components
GraphX: Graph processing and graph-parallel computations
Spark Benefits
• Linear Scalability• Ease of Use• Fault Tolerance
• For developers and data scientists• Tight but not mandatory Hadoop integration
Spark Facts
• Current Release: 1.3.0• Over 450 contributors, most active Apache Big Data project.• Huge public interest:
Source: http://www.google.com/trends/explore?hl=en-US#q=apache%20spark,%20apache%20hadoop&cmpt=q
Daytona GraySort Performance
Hadoop MR Record Spark Record
Data Size 102.5 TB 100 TB
Elapsed Time 72 mins 23 mins
# Nodes 2100 206
# Cores 50400 physical 6592 virtual
Cluster Disk Throughput 3150 GB/s 618 GB/s
Network Dedicated DC, 10Gbps EC2, 10Gbps
Sort Rate 1.42 TB/min 4.27 TB/min
Sort Rate/Node 0.67 GB/min 20.7 GB/min
Source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.htmlBenchmark: http://sortbenchmark.org/
How does it work?
Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
RDDCreation
SchedulingDAG
TaskExecution
How does it work?
Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
RDDCreation
SchedulingDAG
TaskExecution
How does it work?
Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
RDDCreation
SchedulingDAG
TaskExecution
Spark vs Hadoop
• Spark is RAM while Hadoop is HDFS (disk) bound
• API easier to reason about & to develop against
• Fully compatible with Hadoop Input/Output formats
• Hadoop more mature, Spark ecosystem growing fast
Ecosystem Flexibility
RDBMS
StreamsWeb APIs
DCPKVN1QLViews
BatchingArchived Data
OLTP
Infrastructure Consolidation
StreamsWeb APIs
UserInteraction
Couchbase Connector
Spark Core Automatic Cluster and Resource Management Creating and Persisting RDDs Java APIs in addition to Scala (planned before GA)
Spark SQL Easy JSON handling and querying Tight N1QL Integration (dp2)
Spark Streaming Persisting DStreams DCP source (planned before GA)
Connector Facts
• Current Version: 1.0.0-dp• DP2 upcoming• GA planned for Q3
Code:https://github.com/couchbaselabs/couchbase-spark-connector
Docs until GA: https://github.com/couchbaselabs/couchbase-spark-connector/wiki
Questions
Matt Ingenthron @ingenthr
Michael Nitschinger @daschl
Thanks
Additional Slides
Use Case at Linkedin
52
• Site Reliability Engineer (SRE) at LinkedIn• SRE for Profile & Higher-Education• Member of LinkedIn’s CBVT• B.E. (Electrical Engineering) from
the University of Queensland,Australia
Michael Kehoe
• Kafka was created by LinkedIn• Kafka is a publish-subcribe system built as a distributed commit log• Processes 500+ TB/ day (~500 billion messages) @ LinkedIn
Kafka @ LinkedIn
• Monitoring• InGraphs
• Traditional Messaging (Pub-Sub)
• Analytics• Who Viewed my Profile• Experiment reports• Executive reports
• Building block for (log) distributibuted applications• Pinot• Espresso
LinkedIn’s uses of Kafka
Use Case: Kafka to Hadoop (Analytics)
• LinkedIn tracks data to better understand how members use our products
• Information such as which page got viewed and which content got clicked on are sent into a Kafka cluster in each data center
• Some of these events are all centrally collected and pushed onto our Hadoop grid for analy sis and daily report generation
Couchbase @ LinkedIn
• About 25 separate services with one or more clusters in multiple data centers
• Up to 100 servers in a cluster
• Single and Multi-tenant clusters
Use Case: Jobs Cluster
• Read scaling, Couchbase ~80k QPS, 24 server cluster(s)• Hadoop to pre-build data by partition• Couchbase 99 percentile latencies
Hadoop to Couchbase
• Our primary use-case for Hadoop Couchbase is for building (warming) / recovering Couchbase buckets
• LinkedIn built it’s own in-house solution to work with our ETL processes, cache invalidation procedures etc