Using Spark over Cassandra

REAL TIME ANALYTICS WITH SPARK OVER CASSANDRA

Meetup - Jan 27, 2014

Spark is a distributed open-source framework for real-time data processing over Hadoop.Influenced by Google’s Dremel project, it was developed at UC Berkeley and is now part of Apache Incubator.Spark introduces a functional mapreduce model for repetitive iterations over distributed data.

Intro - What’s Spark?

1. In-coming events digestion (filtering, categorizing, storing). We currently use RabbitMQ and Storm, but Spark can be used here.

2. Batch processingIn our case, attribution per-conversion, aggregation per-keyword/ad and optimization per-campaign. We currently have a proprietary infrastructure, that doesn’t scale very well. Spark would shine here.

3. Grids/widgets for online - slice n dice.We currently have aggregation tables in MySQL. A short demo of what we did here with Spark...

Basic problems

We manage billions of keywords.We handle hundreds of millions of clicks and conversions per day.Our clients query the data in lots of varying ways:

● different aggregations, time-periods, filters, sorts

● drill-down to specific items of concern

Problem #3: Grids over billions of cells

Architecture for demo

Spark Master Web Server

Grid in App Server

Spark Worker Cassandra

Demo...

Code Snippet - setting up an RDD

val job = new Job()

job.setInputFormatClass(classOf[ColumnFamilyInputFormat])

val configuration: Configuration = job.getConfiguration

ConfigHelper.setInputInitialAddress(configuration, cassandraHost)

ConfigHelper.setInputRpcPort(configuration, cassandraPort)

ConfigHelper.setOutputInitialAddress(configuration, cassandraHost)

ConfigHelper.setOutputRpcPort(configuration, cassandraPort)

ConfigHelper.setInputColumnFamily(configuration, keyspaceName, columnFamily)

ConfigHelper.setThriftFramedTransportSizeInMb(configuration, 2047)

ConfigHelper.setThriftMaxMessageLengthInMb(configuration, 2048)

ConfigHelper.setInputSlicePredicate(configuration, predicate)

ConfigHelper.setInputPartitioner(configuration, "Murmur3Partitioner")

ConfigHelper.setOutputPartitioner(configuration, "Murmur3Partitioner")

val casRdd = sc.newAPIHadoopRDD( configuration, classOf[ColumnFamilyInputFormat],

classOf[ByteBuffer], classOf[util.SortedMap[ByteBuffer, IColumn]])

Mapp’in & reduc’in with Spark

val flatRdd = creaeteFlatRDD(cachedRDD, startDate, endDate, profileId, statusInTarget)

val withGroupByScores = flatRdd.map {

case (entity, performance) => {

val scores = performance.groupBy(score => score.name )

(entity, scores)

val withAggrScores = withGroupByScores.map {

case (entity, scores) => { val aggrScores = scores.map {

case (column, sc) => {

val aggregation = sc.reduce[Score]({

(left, right) => { Score(left.name, left.value + right.value) })

(column, aggregation)

(entity, aggrScores)

Reading RAM is suddenly a hot-spot..

def createByteArray(date: String, column: Column, value: ByteBuffer): Array[Byte] = {

val daysFromEpoch = calcDaysFromEpoch(date)

val columnOrdinal = column.id

val buffer = ByteBuffer.allocate(4 + 4 + value.remaining())

buffer.putInt(daysFromEpoch)

buffer.putInt(columnOrdinal)

buffer.put(value)

buffer.array()

● For this demo: EC2 Cluster of Master and 2 Slave nodes.● Each Slave with: 240Gb memory, 32 cores, SSD drives,

10Gb network● Data size: 100Gb● Cassandra 2.1● Spark 0.8.0● Rule of thumb for cost est.: ~25K $ / Tb of data.

You’ll probably need X2 memory, as RDD’s are immutable.

White Hat* - facts

* Colored hats metaphor taken from de Bono’s “Six Thinking Hats”

Yellow Hat - optimism

● Full slice ‘n dice over all data with acceptable latency for online (< 5 seconds)

● Additional aggregations at no extra performance cost● Ease of setup (but as always, be prepared for some

tinkering)● Single source of truth● Horizontal scale● MapReduce capabilities for machine learning algorithms● Enable merging recent data with old data (what Nathan

Marz coinded: “lambda architecture”)

Black Hat - concerns

● System stability● Changing API● Limited ecosystem● Scala-based code - learning curve● Maintainability: optimal speed means low-level of

abstraction.● Data duplication, especially in-transit● Master node is a single point of failure● Scheduling

Green Hat - alternatives

● Alternatives to Spark: ○ Cloudera’s Impala (commercial product)○ Presto (recently open-sourced out of Facebook)○ Trident/Storm (for stream processing)

Red Hat - emotions, intuitions

Spark’s technology is nothing short of astonishing: yesterday’s “impossible!” is today’s “merely difficult...”

THANK YOU(AND YES, WE ARE HIRING!)

NOAM.BARCAY@KENSHOO.COM

Using Spark over Cassandra

Technology

Harnessing Spark and Cassandra with Groovy

Intro to py spark (and cassandra)

Olap with Spark and Cassandra

Datastax Cassandra + Spark Streaming

Spark and Cassandra - GOTO Bloggotocon.com/dl/goto-cph-2015/slides/ArtemAliev_SolvingClassical... · Spark and Cassandra. Agenda: ... import org.apache.spark.mllib.regression.LabeledPoint

Spark/Cassandra Integration Theory & Practice

Big data analytics with Spark & Cassandra

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and Scala

Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Spark with Cassandra by Christopher Batey

Spark and cassandra (Hulu Talk)

Spark cassandra integration 2016

A GUIDE TO STRESS TESTING KAFKA, SPARK AND CASSANDRA … · Spark Workers. The nodes are named Spark-Cassandra-Master, Spark-Cassandra-Worker01 and Spark-Cassandra-Worker02. The Cassandra

Analytics with Spark and Cassandra

Chapter 1: An Introduction to SMACK...Chapter 7: Study Case 1 - Spark and Cassandra Figure 7-1. Canonical Spark Cassandra cluster Figure 7-2. Cassandra process and Spark worker one

Cassandra spark connector

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

Manchester Hadoop Meetup: Spark Cassandra Integration

PySpark Cassandra - Amsterdam Spark Meetup

Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra