Upload
jim-hatcher
View
173
Download
0
Embed Size (px)
Citation preview
Jim HatcherDFW Cassandra Users - Meetup
5/18/2016
5 Ways to use Spark to Enrich your Cassandra Environment
C*
Agenda• Introduction• Data Systems• What is Cassandra?• Tradeoffs vs. RDBMS• Addressing Limitations• What is Spark?• Five Ways to Use Spark in a Cassandra Environment• ETL• Data Migrations• Consistency Checking / Syncing Denormalized Data• Analytics• Machine Learning
• Spark Resources
Introduction
At IHS, we take raw data and turn it into information and insights for our customers.
Automotive Systems (CarFax)Defense Systems (Jane’s)Oil & Gas Systems (Petra)Maritime SystemsTechnology Systems (Electronic Parts Database, Root Metrics)
Sources of Raw Data Structure DataAdd Value
Customer-facing Systems
Data Systems
Big Data AnalyticsHadoop, MapReduce, Hive,
Spark
NoSQLCassandra, Hbase,
MongoDB
Data WarehousingSQL Server, Oracle, SAS,
Tableau
Relational DatabaseSQL Server, Oracle, DB2
Analytical Operational
Conv
entio
nal S
cale
Big
Data
Batch ProcessingMinutes-to-Hours
Range QueriesVisualization / Dashboards
“Commodity” HardwareScale Out
Real-time ProcessingMilliseconds
Discrete Seeks/UpdatesLine of Business Apps
Large Servers using Shared StorageScale Up
Hadoop, MapReduce, Hive, Spark
Cassandra, Hbase, MongoDb
SQL Server, Oracle, SAS, Tableau SQL Server, Oracle, DB2
Analytical Operational
Conv
entio
nal S
cale
Big
Data
Data Systems
Factors:• Size/Scale of Data• Multi-Data Center (with writes)• Rate of Data Ingest• Massive Concurrency• Uptime Requirements• Operational Complexity
What is Cassandra?
Cassandra Cluster
B
C
D
E
F
Client
-9223372036854775808through
-6148914691236517207
-6148914691236517206through
-3074457345618258605
-3074457345618258604through
-3
-2through
3074457345618258599
3074457345618258600through
6148914691236517201
6148914691236517202through
9223372036854775808
CREATE KEYSPACE ordersWITH replication ={ 'class': 'SimpleStrategy', 'replication_factor': 3};
CREATE TABLE orders.customer ( customer_id uuid, customer_name varchar, customer_age int, PRIMARY KEY ( customer_id ))
INSERT INTO customer (feb2b9e6-613b-4e6b-b470-981dc4d42525, ‘Bob’, 35)
SELECT customer_name, customer_age FROM customer WHERE customer_id = feb2b9e6-613b-4e6b-b470-981dc4d42525
A
What is Cassandra?
Cassandra is• A NoSQL (i.e., non-relational) operational database• Distributed (the data lives on many nodes)• Highly Scalable (no scale ceiling)• Highly Available (no single point of failure)• Open Source• Fast (optimized for fast reads and fast writes)
Cassandra uses:• Commodity Hardware (no SAN/NAS or high-end hardware)• Ring Architecture (not master/slave)• Flexible Data Model• CQL (abstraction layer for data access; not tied to a particular language)
DataStax provides consulting, support, and additional software around Cassandra.
What you gain with Cassandra:• Linear Horizontal Scale (HUGE!)• Multi-Data Center / Active-Active• Fast, Scalable writes• Fast Reads (by the key(s))• Continuous Availability• High Concurrency• Schema Flexibility• Cheaper (commodity hardware)?
What you give up with Cassandra:• Tables only queryable by key• 3rd Normal Form• Data Integrity Checks• Foreign Keys• Unique Indexes
• Joins• Secondary Indexes• Grouping / Aggregation• ACID
Tradeoffs (vs. RDBMS)
Limitations
Solutions
Denormalize Data
Idempotent Data Model
Index in Another Tool
Consistency Checker
Batch Analytics
Batch ETL
Tables only Queryable by Key X X
No Foreign Keys / Unique Indexes X X
No JOINs X X
No GROUP BYs / Aggregation X
Keeping Denormalized Data in Sync X
Creating New Tables for New Queries X
Addressing Limitations
What is Spark?Spark is a processing framework designed to work with distributed data.
“up to 100X faster than MapReduce” according to spark.apache.org
Used in any ecosystem where you want to work with distributed data (Hadoop, Cassandra, etc.)
Includes other specialized libraries:• SparkSQL• Spark Streaming• MLLib• GraphX
Spark Facts
Conceptually Similar To MapReduce
Written In Scala
Supported By DataBricks
Supported Languages Scala, Java, or Python
Spark Architecture
Spark Client
Driver
Spark Context
Spark Master
Spark Worker
Spark Worker
Spark Worker
Executor
Executor
Executor
1. Request Resources2. Allocate Resources
3. S
tart
Exe
cuto
rs
4. P
erfo
rm C
ompu
tatio
n
Credit: https://academy.datastax.com/courses/ds320-analytics-apache-spark/introduction-spark-architecture
Spark Terms / Concepts
Resilient Distributed Dataset (RDD)Represents an immutable, partitioned collection of elements that can be operated on in parallel.
DataframeRDD + schemaThis is the “way that everything in Spark is going”
Actions and TransformationsTransformations – create a new RDD but are executed in a lazy fashion (i.e., when an action fires)Actions – cause a computation to be run and return a response to the driver program
Executing Spark CodeSpark Shell – run Spark commands interactively via the Spark REPLSpark Submit – execute Spark jobs (i.e., JAR files); you can build a JAR file in the Java IDE of your choice – Eclipse, IntelliJ, etc.
Spark with Cassandra
Credit: https://academy.datastax.com/courses/ds320-analytics-apache-spark/introduction-spark-architecture
Cassandra Cluster
A
CB
Spark Worker
Spark WorkerSpark Worker
Spark Master
Spark Client
Spark Cassandra Connector – open source, supported by DataStaxhttps://github.com/datastax/spark-cassandra-connector
ETL (Extract, Transform, Load)
Text File
JDBC Data Source
Cassandra
Hadoop
Extract Data
Spark: Create RDD
Data Source(s) Spark Code
Transform Data
Spark: Map function
Spark Code
Cassandra
Data Source(s)
Load Data
Spark: Save
Spark Code
ETLimport org.apache.spark.{SparkConf, SparkContext}
//Create a SparkConfig and a SparkContextval sparkConf = new SparkConf(true) .setAppName("MyEtlApp") .setMaster("spark://10.1.1.1:7077") .set("spark.cassandra.connection.host", "10.2.2.2"))val sc = new SparkContext(sparkConf)
//EXTRACT: Using the SparkContext, read a text file and expose it as an RDDval logfile = sc.textFile("/weblog.csv")
//TRANSFORM: split the CSV into fields and then put the fields into a tupleval split = logfile.map { line => line.split(",")}val transformed = split.map { record => ( record(0), record(1) )}
//LOAD: write the tuple structure into Cassandratransformed.saveToCassandra("test", "weblog")
Data Migrations
Cassandra
Extract Data
Spark: Create RDD
Data Source(s) Spark Code
Transform Data
Spark: Map function
Spark Code
Cassandra
Data Source(s)
Load Data
Spark: Save
Spark Code
Data Migrationsimport org.apache.spark.{SparkConf, SparkContext}
//Create a SparkConfig and a SparkContextval sparkConf = new SparkConf(true) .setAppName("MyEtlApp") .setMaster("spark://10.1.1.1:7077") .set("spark.cassandra.connection.host", "10.2.2.2"))val sc = new SparkContext(sparkConf)
//EXTRACT: Using the SparkContext, read a C* table and expose it as an RDDval weblogRecords = sc.cassandraTable("test", "weblog").select("logtime", "page")
//TRANSFORM: pull fields out of the CassandraRow and put the fields into a tupleval transformed = weblogRecords.map { row => ( row.getString(1), row.getLong(0) )}
//LOAD: write the tuple structure into Cassandra into a different tabletransformed.saveToCassandra("test", "weblog_bypage")
Consistency Checking / Syncing Denormalized Data
Cassandra
Extract Data
Spark: Create RDD of missing
records
Data Source(s) Spark Code
Base Table
Deno
rmal
ized
Tabl
e 1
Deno
rmal
ized
Tabl
e 2
Transform Data
Spark: Map Function
Spark Code Spark Code
Load Data
Spark: Save missing records
to Cassandra
Consistency Checking / Syncing Denormalized Data
import org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)
val query1 = """ SELECT w1.logtime, w1.page FROM test.weblog w1 LEFT JOIN test.weblog_bypage w2 ON w1.page = w2.page WHERE w2.page IS NULL"""
val results1 = hc.sql(query1)results1.collect.foreach(println)
val newRecord = Array(("2016-05-17 2:00:00", "page6.html"))val newRecordRdd = sc.parallelize(newRecord)newRecordRdd.saveToCassandra("test", "weblog")results1.collect.foreach(println)
val transformed = results1.map { row => ( row.getString(1), row.get(0) )}transformed.saveToCassandra("test", "weblog_bypage")
Analytics
//EXAMPLE of a JOINval query2 = """ SELECT w.page, w.logtime, p.owner FROM test.weblog w INNER JOIN test.webpage p ON w.page = p.page"""
val results2 = hc.sql(query2)
results2.collect.foreach(println)
//EXAMPLE of a GROUP BYval query3 = """ SELECT w.page, COUNT(*) AS RecordCount FROM test.weblog w GROUP BY w.page ORDER BY w.page"""
val results3 = hc.sql(query3)
results3.collect.foreach(println)
Machine Learningimport org.apache.spark.ml.Pipelineimport org.apache.spark.ml.classification.LogisticRegressionimport org.apache.spark.ml.feature.{HashingTF, Tokenizer}import org.apache.spark.mllib.linalg.Vectorimport org.apache.spark.sql.{DataFrame, Row, SQLContext}
case class LabeledDocument(id: Long, text: String, label: Double)case class DataDocument(id: Long, text: String)
lazy val sqlContext = new SQLContext(sc)import sqlContext.implicits._
// Load the training dataval modelTrainingRecords = sc.cassandraTable("test", "ml_training") .select("id", "text", "label")val labeledDocuments = modelTrainingRecords.map { record => LabeledDocument(record.getLong("id") , record.getString("text"), record.getDouble("label"))}.toDF
Machine Learning// Create the pipelineval pipeline = {
val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words")
val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features")
val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.001)
new Pipeline() .setStages(Array(tokenizer, hashingTF, lr))}
// Fit the pipeline to training documents.val model = pipeline.fit(labeledDocuments)
Machine Learning// Load the data to run against the modelval modelTestRecords = sc.cassandraTable("test", "ml_text")val dataDocuments = modelTestRecords.map { record => DataDocument(record.getLong(0), record.getString(1)) }.toDF
model.transform(dataDocuments) .select("id", "text", "probability", "prediction") .collect() .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) => println(s"($id, $text) --> prob=$prob, prediction=$prediction") }
ResourcesSpark• Bookshttp://shop.oreilly.com/product/0636920028512.do
Scala (Knowing Scala with really help you progress in Spark)• Functional Programming Principles in Scala (videos)https://www.youtube.com/user/afigfigueira/playlists?shelf_id=9&view=50&sort=dd• Bookshttp://www.scala-lang.org/documentation/books.html
Spark and Cassandra• DataStax Academyhttp://academy.datastax.com/
• Self-paced course: DS320: DataStax Enterprise Analytics with Apache Spark – Really Good!• Tutorials
• Spark Cassandra Connector website – lots of good exampleshttps://github.com/datastax/spark-cassandra-connector