Spark with Cassandra by Christopher Batey

Spark with Cassandra@chbatey

@chbatey

Christopher Batey - @chbatey

Freelance Software engineer/Devops/Architect

Loves:

Breaking software

Distributed systems

Hates:

Fragile software

Untested code :(

Introduction

Audience

Assumption is…

OverviewCassandra architecture

Modelling time series - weather data for many stations

What can be done with pure C*

When to introduce Spark

What do we use Spark for?

Batch processing

Machine Learning

Ad-hoc querying of large datasets

Streaming processing

What do we use Cassandra for?

Operational Database

Casandra overview

@chbatey

Master slave

Master

Async replication

@chbatey

Sharding

@chbatey

The other way

@chbatey

Consistent hashing

jim age: 36 car: ford gender: M

carol age: 37 car: bmw gender: F

johnny age: 12 gender: M

suzy age: 10 gender: F

Partition Key Hash value

jim 350

carol 998

johnny 50

suzy 600

Partition Key

D249750

ExampleNode Start range End range Primary

keyHash value

A 0 249 johnny 50

B 250 499 jim 350

C 500 749 suzy 600

D 750 999 carol 998

@chbatey

Fault tolerance

Replicate each price of data on multiple nodes

Keep replicas on different racks

Datacenter aware

client

RF3 RF3

WRITE CL = 1 We have

replication!

Storing weather dataCREATE TABLE raw_weather_data ( weather_station text, year int, month int, day int, hour int, temp double, PRIMARY KEY ((weather_station), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

@chbatey

Primary key relationshipPRIMARY KEY ((weatherstation_id),year,month,day,hour)

Data Localityweatherstation_id=‘10010:99999’ ?

1000 Node Cluster

You are here!

Primary key relationshipPRIMARY KEY ((weatherstation_id),year,month,day,hour)

Partition Key Clustering Columns

10010:99999

2005:12:1:8:temp 2005:12:1:7:temp

PRIMARY KEY ((weatherstation_id),year,month,day,hour)

Partition Key Clustering Columns

10010:99999-5.1

2005:12:1:10:temp

2005:12:1:9:temp

Primary key relationship

I have a question!!

What happens if I want to do an adhoc query??

I’ve stored the data partitioned by weather id…

… now I want a report for all stations

I’ve stored the raw weather data…

… now I want rollups/aggregates

Analytics Workload Isolation

Deployment

- Spark worker on each of the Cassandra nodes

- Partitions made up of LOCAL cassandra data

Cassandra Data is Distributed By Token Range

Node 1

Node 2

Node 3

Node 4

Node 1

Node 2

Node 3

Node 4

Without vnodes

Node 1

Node 2

Node 3

Node 4

With vnodes

Cassandra RDD

Each Spark partition is made up of token ranges that live on the same

Each Spark partition is made up of Cassandra partitions that are on the

same node

Storing weather dataCREATE TABLE raw_weather_data ( weather_station text, year int, month int, day int, hour int, temp double, PRIMARY KEY ((weather_station), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

(count: 24, mean: 14.428150, stdev: 7.092196, max: 28.034969, min: 0.675863)

Partition key =

Single node

(count: 11242, mean: 8.921956, stdev: 7.428311, max: 29.997986, min: -2.200000)

No partition key =

Every node

Not quick enough?

daily_aggregate_precipCREATE TABLE daily_aggregate_precip ( weather_station text, year int, month int, day int, precipitation counter, PRIMARY KEY ((weather_station), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);

SELECT precipitation FROM daily_aggregate_precip WHERE weather_station='010010:99999' AND year=2005 AND month=12 AND day>=1 AND day <= 7;

Weather station info

725030:14732,2008,01,01,00,5.0,-3.9,1020.4,270,4.6,2,0.0

Creating a Stream

Saving the raw data

Building an aggregateCREATE TABLE daily_aggregate_precip ( weather_station text, year int, month int, day int, precipitation counter, PRIMARY KEY ((weather_station), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);

CQL Counter

Want more Spark/C* goodness?

@helenaedelson

ConclusionCassandra = OLTP database for the large scale

Spark can be used to do complex queries in a partition

Or analytical queries for an entire table

Spark streaming to keep tables up to date

Thanks for listeningQuestions later? @chbatey

Spark with Cassandra by Christopher Batey

Data & Analytics

Spark zeppelin-cassandra at synchrotron

Spark Cassandra 2016

Using Spark over Cassandra

Cassandra London - C* Spark Connector

Cassandra and Spark - Tim Berglund

Kafka spark - cassandra

Spark/Cassandra Integration Theory & Practice

Chapter 1: An Introduction to SMACK...Chapter 7: Study Case 1 - Spark and Cassandra Figure 7-1. Canonical Spark Cassandra cluster Figure 7-2. Cassandra process and Spark worker one

Cassandra Data Maintenance with Spark

Introduction to Cassandra • Why Spark - Apache Cassandra | Apache Kafka | Apache Spark · 2017. 12. 20. · • Introduction to Cassandra • Why Spark + Cassandra • Problem background

Harnessing Spark and Cassandra with Groovy

Cassandra and Spark SQL

Cassandra spark connector

Announcing Spark Driver for Cassandra

Light Weight Transactions Under Stress (Christopher Batey, The Last Pickle) | Cassandra Summit 2016

Manchester Hadoop Meetup: Spark Cassandra Integration

StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Cassandra + Spark + Elk