5 Ways to Use Spark to Enrich your Cassandra Environment

Jim HatcherDFW Cassandra Users - Meetup

5/18/2016

5 Ways to use Spark to Enrich your Cassandra Environment

C*

Agenda• Introduction• Data Systems• What is Cassandra?• Tradeoffs vs. RDBMS• Addressing Limitations• What is Spark?• Five Ways to Use Spark in a Cassandra Environment• ETL• Data Migrations• Consistency Checking / Syncing Denormalized Data• Analytics• Machine Learning

• Spark Resources

Introduction

Jim [email protected]

At IHS, we take raw data and turn it into information and insights for our customers.

Automotive Systems (CarFax)Defense Systems (Jane’s)Oil & Gas Systems (Petra)Maritime SystemsTechnology Systems (Electronic Parts Database, Root Metrics)

Sources of Raw Data Structure DataAdd Value

Customer-facing Systems

Data Systems

Big Data AnalyticsHadoop, MapReduce, Hive,

Spark

NoSQLCassandra, Hbase,

MongoDB

Data WarehousingSQL Server, Oracle, SAS,

Tableau

Relational DatabaseSQL Server, Oracle, DB2

Analytical Operational

Conv

entio

nal S

cale

Big

Data

Batch ProcessingMinutes-to-Hours

Range QueriesVisualization / Dashboards

“Commodity” HardwareScale Out

Real-time ProcessingMilliseconds

Discrete Seeks/UpdatesLine of Business Apps

Large Servers using Shared StorageScale Up

Hadoop, MapReduce, Hive, Spark

Cassandra, Hbase, MongoDb

SQL Server, Oracle, SAS, Tableau SQL Server, Oracle, DB2

Analytical Operational

Conv

entio

nal S

cale

Big

Data

Data Systems

Factors:• Size/Scale of Data• Multi-Data Center (with writes)• Rate of Data Ingest• Massive Concurrency• Uptime Requirements• Operational Complexity

What is Cassandra?

Cassandra Cluster

B

C

D

E

F

Client

-9223372036854775808through

-6148914691236517207

-6148914691236517206through

-3074457345618258605

-3074457345618258604through

-3

-2through

3074457345618258599

3074457345618258600through

6148914691236517201

6148914691236517202through

9223372036854775808

CREATE KEYSPACE ordersWITH replication ={ 'class': 'SimpleStrategy', 'replication_factor': 3};

CREATE TABLE orders.customer ( customer_id uuid, customer_name varchar, customer_age int, PRIMARY KEY ( customer_id ))

INSERT INTO customer (feb2b9e6-613b-4e6b-b470-981dc4d42525, ‘Bob’, 35)

SELECT customer_name, customer_age FROM customer WHERE customer_id = feb2b9e6-613b-4e6b-b470-981dc4d42525

A

What is Cassandra?

Cassandra is• A NoSQL (i.e., non-relational) operational database• Distributed (the data lives on many nodes)• Highly Scalable (no scale ceiling)• Highly Available (no single point of failure)• Open Source• Fast (optimized for fast reads and fast writes)

Cassandra uses:• Commodity Hardware (no SAN/NAS or high-end hardware)• Ring Architecture (not master/slave)• Flexible Data Model• CQL (abstraction layer for data access; not tied to a particular language)

DataStax provides consulting, support, and additional software around Cassandra.

What you gain with Cassandra:• Linear Horizontal Scale (HUGE!)• Multi-Data Center / Active-Active• Fast, Scalable writes• Fast Reads (by the key(s))• Continuous Availability• High Concurrency• Schema Flexibility• Cheaper (commodity hardware)?

What you give up with Cassandra:• Tables only queryable by key• 3rd Normal Form• Data Integrity Checks• Foreign Keys• Unique Indexes

• Joins• Secondary Indexes• Grouping / Aggregation• ACID

Tradeoffs (vs. RDBMS)

Limitations

Solutions

Denormalize Data

Idempotent Data Model

Index in Another Tool

Consistency Checker

Batch Analytics

Batch ETL

Tables only Queryable by Key X X

No Foreign Keys / Unique Indexes X X

No JOINs X X

No GROUP BYs / Aggregation X

Keeping Denormalized Data in Sync X

Creating New Tables for New Queries X

Addressing Limitations

What is Spark?Spark is a processing framework designed to work with distributed data.

“up to 100X faster than MapReduce” according to spark.apache.org

Used in any ecosystem where you want to work with distributed data (Hadoop, Cassandra, etc.)

Includes other specialized libraries:• SparkSQL• Spark Streaming• MLLib• GraphX

Spark Facts

Conceptually Similar To MapReduce

Written In Scala

Supported By DataBricks

Supported Languages Scala, Java, or Python

Spark Architecture

Spark Client

Driver

Spark Context

Spark Master

Spark Worker

Spark Worker

Spark Worker

Executor

Executor

Executor

1. Request Resources2. Allocate Resources

3. S

tart

Exe

cuto

rs

4. P

erfo

rm C

ompu

tatio

n

Credit: https://academy.datastax.com/courses/ds320-analytics-apache-spark/introduction-spark-architecture

Spark Terms / Concepts

Resilient Distributed Dataset (RDD)Represents an immutable, partitioned collection of elements that can be operated on in parallel.

DataframeRDD + schemaThis is the “way that everything in Spark is going”

Actions and TransformationsTransformations – create a new RDD but are executed in a lazy fashion (i.e., when an action fires)Actions – cause a computation to be run and return a response to the driver program

Executing Spark CodeSpark Shell – run Spark commands interactively via the Spark REPLSpark Submit – execute Spark jobs (i.e., JAR files); you can build a JAR file in the Java IDE of your choice – Eclipse, IntelliJ, etc.

Spark with Cassandra

Credit: https://academy.datastax.com/courses/ds320-analytics-apache-spark/introduction-spark-architecture

Cassandra Cluster

A

CB

Spark Worker

Spark WorkerSpark Worker

Spark Master

Spark Client

Spark Cassandra Connector – open source, supported by DataStaxhttps://github.com/datastax/spark-cassandra-connector

https://github.com/datastax/spark-cassandra-connector

ETL (Extract, Transform, Load)

Text File

JDBC Data Source

Cassandra

Hadoop

Extract Data

Spark: Create RDD

Data Source(s) Spark Code

Transform Data

Spark: Map function

Spark Code

Cassandra

Data Source(s)

Load Data

Spark: Save

Spark Code

ETLimport org.apache.spark.{SparkConf, SparkContext}

//Create a SparkConfig and a SparkContextval sparkConf = new SparkConf(true) .setAppName("MyEtlApp") .setMaster("spark://10.1.1.1:7077") .set("spark.cassandra.connection.host", "10.2.2.2"))val sc = new SparkContext(sparkConf)

//EXTRACT: Using the SparkContext, read a text file and expose it as an RDDval logfile = sc.textFile("/weblog.csv")

//TRANSFORM: split the CSV into fields and then put the fields into a tupleval split = logfile.map { line => line.split(",")}val transformed = split.map { record => ( record(0), record(1) )}

//LOAD: write the tuple structure into Cassandratransformed.saveToCassandra("test", "weblog")

Data Migrations

Cassandra

Extract Data

Spark: Create RDD


Transform Data

Spark: Map function

Spark Code

Cassandra

Data Source(s)

Load Data

Spark: Save

Spark Code

Data Migrationsimport org.apache.spark.{SparkConf, SparkContext}

//Create a SparkConfig and a SparkContextval sparkConf = new SparkConf(true) .setAppName("MyEtlApp") .setMaster("spark://10.1.1.1:7077") .set("spark.cassandra.connection.host", "10.2.2.2"))val sc = new SparkContext(sparkConf)

//EXTRACT: Using the SparkContext, read a C* table and expose it as an RDDval weblogRecords = sc.cassandraTable("test", "weblog").select("logtime", "page")

//TRANSFORM: pull fields out of the CassandraRow and put the fields into a tupleval transformed = weblogRecords.map { row => ( row.getString(1), row.getLong(0) )}

//LOAD: write the tuple structure into Cassandra into a different tabletransformed.saveToCassandra("test", "weblog_bypage")

Consistency Checking / Syncing Denormalized Data

Cassandra

Extract Data

Spark: Create RDD of missing

records


Base Table

Deno

rmal

ized

Tabl

e 1

Deno

rmal

ized

Tabl

e 2

Transform Data

Spark: Map Function

Spark Code Spark Code

Load Data

Spark: Save missing records

to Cassandra

Consistency Checking / Syncing Denormalized Data

import org.apache.spark.sql.hive.HiveContext

val hc = new HiveContext(sc)

val query1 = """ SELECT w1.logtime, w1.page FROM test.weblog w1 LEFT JOIN test.weblog_bypage w2 ON w1.page = w2.page WHERE w2.page IS NULL"""

val results1 = hc.sql(query1)results1.collect.foreach(println)

val newRecord = Array(("2016-05-17 2:00:00", "page6.html"))val newRecordRdd = sc.parallelize(newRecord)newRecordRdd.saveToCassandra("test", "weblog")results1.collect.foreach(println)

val transformed = results1.map { row => ( row.getString(1), row.get(0) )}transformed.saveToCassandra("test", "weblog_bypage")

Analytics

//EXAMPLE of a JOINval query2 = """ SELECT w.page, w.logtime, p.owner FROM test.weblog w INNER JOIN test.webpage p ON w.page = p.page"""

val results2 = hc.sql(query2)

results2.collect.foreach(println)

//EXAMPLE of a GROUP BYval query3 = """ SELECT w.page, COUNT(*) AS RecordCount FROM test.weblog w GROUP BY w.page ORDER BY w.page"""

val results3 = hc.sql(query3)

results3.collect.foreach(println)

Machine Learningimport org.apache.spark.ml.Pipelineimport org.apache.spark.ml.classification.LogisticRegressionimport org.apache.spark.ml.feature.{HashingTF, Tokenizer}import org.apache.spark.mllib.linalg.Vectorimport org.apache.spark.sql.{DataFrame, Row, SQLContext}

case class LabeledDocument(id: Long, text: String, label: Double)case class DataDocument(id: Long, text: String)

lazy val sqlContext = new SQLContext(sc)import sqlContext.implicits._

// Load the training dataval modelTrainingRecords = sc.cassandraTable("test", "ml_training") .select("id", "text", "label")val labeledDocuments = modelTrainingRecords.map { record => LabeledDocument(record.getLong("id") , record.getString("text"), record.getDouble("label"))}.toDF

Machine Learning// Create the pipelineval pipeline = {

val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words")

val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features")

val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.001)

new Pipeline() .setStages(Array(tokenizer, hashingTF, lr))}

// Fit the pipeline to training documents.val model = pipeline.fit(labeledDocuments)

Machine Learning// Load the data to run against the modelval modelTestRecords = sc.cassandraTable("test", "ml_text")val dataDocuments = modelTestRecords.map { record => DataDocument(record.getLong(0), record.getString(1)) }.toDF

model.transform(dataDocuments) .select("id", "text", "probability", "prediction") .collect() .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) => println(s"($id, $text) --> prob=$prob, prediction=$prediction") }

ResourcesSpark• Bookshttp://shop.oreilly.com/product/0636920028512.do

Scala (Knowing Scala with really help you progress in Spark)• Functional Programming Principles in Scala (videos)https://www.youtube.com/user/afigfigueira/playlists?shelf_id=9&view=50&sort=dd• Bookshttp://www.scala-lang.org/documentation/books.html

Spark and Cassandra• DataStax Academyhttp://academy.datastax.com/

• Self-paced course: DS320: DataStax Enterprise Analytics with Apache Spark – Really Good!• Tutorials

• Spark Cassandra Connector website – lots of good exampleshttps://github.com/datastax/spark-cassandra-connector

http://shop.oreilly.com/product/0636920028512.do

http://shop.oreilly.com/product/0636920028512.do

https://www.youtube.com/user/afigfigueira/playlists?shelf_id=9&view=50&sort=dd



http://www.scala-lang.org/documentation/books.html



http://academy.datastax.com/



Data & Analytics

5 Ways to Use Spark to Enrich your Cassandra Environment