73
@doanduyhai Spark/Cassandra connector API, Best Practices & Use-Cases DuyHai DOAN, Technical Advocate

Spark cassandra connector.API, Best Practices and Use-Cases

Embed Size (px)

Citation preview

Page 1: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Spark/Cassandra connector API, Best Practices & Use-Cases DuyHai DOAN, Technical Advocate

Page 2: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Who Am I ?!Duy Hai DOAN Cassandra technical advocate •  talks, meetups, confs •  open-source devs (Achilles, …) •  OSS Cassandra point of contact

[email protected] ☞ @doanduyhai

2

Page 3: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Datastax!•  Founded in April 2010

•  We contribute a lot to Apache Cassandra™

•  400+ customers (25 of the Fortune 100), 200+ employees

•  Headquarter in San Francisco Bay area

•  EU headquarter in London, offices in France and Germany

•  Datastax Enterprise = OSS Cassandra + extra features

3

Page 4: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Agenda!•  Connector API by example

•  Best practices

•  Use-cases

•  The “Big Data Platform”

4

Page 5: Spark cassandra connector.API, Best Practices and Use-Cases

Spark/Cassandra connector API!

Spark Core!Spark SQL!

Spark Streaming!!

Page 6: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Connector architecture!All Cassandra types supported and converted to Scala types Server side data filtering (SELECT … WHERE …) Use Java-driver underneath !Scala and Java support

6

Page 7: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Connector architecture – Core API!Cassandra tables exposed as Spark RDDs

Read from and write to Cassandra

Mapping of Cassandra tables and rows to Scala objects •  CassandraRow •  Scala case class (object mapper) •  Scala tuples

7

Page 8: Spark cassandra connector.API, Best Practices and Use-Cases

Spark Core

https://github.com/doanduyhai/Cassandra-Spark-Demo

Page 9: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Connector architecture – Spark SQL !

Mapping of Cassandra table to SchemaRDD •  CassandraSQLRow à SparkRow •  custom query plan •  push predicates to CQL for early filtering

SELECT * FROM user_emails WHERE login = ‘jdoe’;

9

Page 10: Spark cassandra connector.API, Best Practices and Use-Cases

Spark SQL

https://github.com/doanduyhai/Cassandra-Spark-Demo

Page 11: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Connector architecture – Spark Streaming !

Streaming data INTO Cassandra table •  trivial setup •  be careful about your Cassandra data model when having an infinite

stream !!!

Streaming data OUT of Cassandra tables (CDC) ? •  work in progress …

11

Page 12: Spark cassandra connector.API, Best Practices and Use-Cases

Spark Streaming

https://github.com/doanduyhai/Cassandra-Spark-Demo

Page 13: Spark cassandra connector.API, Best Practices and Use-Cases

Q & R

! " !

Page 14: Spark cassandra connector.API, Best Practices and Use-Cases

Spark/Cassandra best practices!

Data locality!Failure handling!

Cross-region operations!

Page 15: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Cluster deployment!C*

SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

Stand-alone cluster

15

Page 16: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Remember token ranges ?!A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X]

n1

n2

n3

n4

n5

n6

n7

n8

A

B

C

D

E

F

G

H

16

Page 17: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Data Locality!C*

SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

Spark partition RDD

Cassandra tokens ranges

17

Page 18: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Data Locality!C*

SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

Use Murmur3Partitioner

18

Page 19: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Read data locality!Read from Cassandra

19

Page 20: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Read data locality!Spark shuffle operations

20

Page 21: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Write to Cassandra without data locality!

Async batches fan-out writes to Cassandra

21

Because of shuffle, original data locality is lost

Page 22: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Or repartition before write !

Write to Cassandra

rdd.repartitionByCassandraReplica("keyspace","table")

22

Page 23: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Write data locality!

23

•  either stream data in Spark layer using repartitionByCassandraReplica() •  or flush data to Cassandra by async batches •  in any case, there will be data movement on network (sorry no magic)

Page 24: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Joins with data locality!

24

CREATE TABLE artists(name text, style text, … PRIMARY KEY(name));

CREATE TABLE albums(title text, artist text, year int,… PRIMARY KEY(title));

val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year") .as((_:String, _:Int)) // Repartition RDDs by "artists" PK, which is "name" .repartitionByCassandraReplica(KEYSPACE, ARTISTS) // Join with "artists" table, selecting only "name" and "country" columns .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name"))

Page 25: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Joins pipeline with data locality!

25

LOCAL READ FROM CASSANDRA

Page 26: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Joins pipeline with data locality!

26

SHUFFLE DATA WITH SPARK

Page 27: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Joins pipeline with data locality!

27

REPARTITION TO MAP CASSANDRA REPLICA

Page 28: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Joins pipeline with data locality!

28

JOIN WITH DATA LOCALITY

Page 29: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Joins pipeline with data locality!

29

ANOTHER ROUND OF SHUFFLING

Page 30: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Joins pipeline with data locality!

30

REPARTITION AGAIN FOR CASSANDRA

Page 31: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Joins pipeline with data locality!

31

SAVE TO CASSANDRA WITH LOCALITY

Page 32: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Perfect data locality scenario!

32

•  read localy from Cassandra •  use operations that do not require shuffle in Spark (map, filter, …) •  repartitionbyCassandraReplica() •  à to a table having same partition key as original table •  save back into this Cassandra table

Sanitize, validate, normalize, transform data USE CASE

Page 33: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Failure handling!Stand-alone cluster

33

C* SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

Page 34: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Failure handling!What if 1 node down ? What if 1 node overloaded ?

34

C* SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

Page 35: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Failure handling!What if 1 node down ? What if 1 node overloaded ? ☞ Spark masterwill re-assign the job to another worker

35

C* SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

Page 36: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Failure handling!

Oh no, my data locality !!!

36

Page 37: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Failure handling!

37

Page 38: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Data Locality Impl!

38

Remember RDD interface ?

abstract'class'RDD[T](…)'{'' @DeveloperApi'' def'compute(split:'Partition,'context:'TaskContext):'Iterator[T]''' protected'def'getPartitions:'Array[Partition]'' '' protected'def'getPreferredLocations(split:'Partition):'Seq[String]'='Nil'''''''''''''''}'

Page 39: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Data Locality Impl!

39

def getPreferredLocations(split: Partition): Cassandra replicas IP address corresponding to this Spark partition

Page 40: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Failure handling!If RF > 1 the Spark master chooses the next preferred location, which

is a replica 😎 Tune parameters: ①  spark.locality.wait ②  spark.locality.wait.process ③  spark.locality.wait.node

40

C* SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

Page 41: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

val confDC1 = new SparkConf(true) .setAppName("data_migration") .setMaster("master_ip") .set("spark.cassandra.connection.host", "DC_1_hostnames") .set("spark.cassandra.connection.local_dc", "DC_1") val confDC2 = new SparkConf(true) .setAppName("data_migration") .setMaster("master_ip") .set("spark.cassandra.connection.host", "DC_2_hostnames") .set("spark.cassandra.connection.local_dc", "DC_2 ") val sc = new SparkContext(confDC1) sc.cassandraTable[Performer](KEYSPACE,PERFORMERS) .map[Performer](???) .saveToCassandra(KEYSPACE,PERFORMERS) (CassandraConnector(confDC2),implicitly[RowWriterFactory[Performer]])

Cross-DC operations!

41

Page 42: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

val confCluster1 = new SparkConf(true) .setAppName("data_migration") .setMaster("master_ip") .set("spark.cassandra.connection.host", "cluster_1_hostnames") val confCluster2 = new SparkConf(true) .setAppName("data_migration") .setMaster("master_ip") .set("spark.cassandra.connection.host", "cluster_2_hostnames") val sc = new SparkContext(confCluster1) sc.cassandraTable[Performer](KEYSPACE,PERFORMERS) .map[Performer](???) .saveToCassandra(KEYSPACE,PERFORMERS) (CassandraConnector(confCluster2),implicitly[RowWriterFactory[Performer]])

Cross-cluster operations!

42

Page 43: Spark cassandra connector.API, Best Practices and Use-Cases

Q & R

! " !

Page 44: Spark cassandra connector.API, Best Practices and Use-Cases

Spark/Cassandra use-cases!

Data cleaning!Schema migration!

Analytics!!

Page 45: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Use Cases!

Load data from various sources

Analytics (join, aggregate, transform, …)

Sanitize, validate, normalize, transform data

Schema migration, Data conversion

45

Page 46: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Data cleaning use-cases!

46

Bug in your application ? Dirty input data ? ☞ Spark job to clean it up! (perfect data locality)

Sanitize, validate, normalize, transform data

Page 47: Spark cassandra connector.API, Best Practices and Use-Cases

Data Cleaning

https://github.com/doanduyhai/Cassandra-Spark-Demo

Page 48: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Schema migration use-cases!

48

Business requirements change with time ? Current data model no longer relevant ? ☞ Spark job to migrate data !

Schema migration, Data conversion

Page 49: Spark cassandra connector.API, Best Practices and Use-Cases

Data Migration

https://github.com/doanduyhai/Cassandra-Spark-Demo

Page 50: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Analytics use-cases!

50

Given existing tables of performers and albums, I want:

①  top 10 most common music styles (pop,rock, RnB, …) ?

②  performer productivity(albums count) by origin country and by decade ?

☞ Spark job to compute analytics !

Analytics (join, aggregate, transform, …)

Page 51: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Analytics pipeline!

51

①  Read from production transactional tables

②  Perform aggregation with Spark

③  Save back data into dedicated tables for fast visualization

④  Repeat step ①

Page 52: Spark cassandra connector.API, Best Practices and Use-Cases

Analytics

https://github.com/doanduyhai/Cassandra-Spark-Demo

Page 53: Spark cassandra connector.API, Best Practices and Use-Cases

Q & R

! " !

Page 54: Spark cassandra connector.API, Best Practices and Use-Cases

The “Big Data Platform”!

Page 55: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Our vision!

55

We had a dream …

to provide a Big Data Platform … built for the performance & high availabitly demands

of IoT, web & mobile applications

Page 56: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Until now in Datastax Enterprise!Cassandra + Solr in same JVM Unlock full text search power for Cassandra CQL syntax extension

56

SELECT * FROM users WHERE solr_query = ‘age:[33 TO *] AND gender:male’;

SELECT * FROM users WHERE solr_query = ‘lastname:*schwei?er’;

Page 57: Spark cassandra connector.API, Best Practices and Use-Cases

Cassandra + Solr

Page 58: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Now with Spark!Cassandra + Spark Unlock full analytics power for Cassandra Spark/Cassandra connector

58

Page 59: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Tomorrow (DSE 4.7)!Cassandra + Spark + Solr Unlock full text search + analytics power for Cassandra

59

Page 60: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Tomorrow (DSE 4.7)!Cassandra + Spark + Solr Unlock full text search + analytics power for Cassandra

60

Page 61: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

The idea!①  Filter the maximum with Cassandra and Solr

②  Fetch only small data set in memory

③  Aggregate with Spark ☞ near real time interactive analytics query possible if restrictive criteria

61

Page 62: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Datastax Entreprise 4.7!With a 3rd component for full text search …

how to preserve data locality ?

62

Page 63: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Stand-alone search cluster caveat!The bad way v1: perform search from the Spark « driver program »

63

Page 64: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Stand-alone search cluster caveat!The bad way v2: search from Spark workers with restricted routing

64

Page 65: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Stand-alone search cluster caveat!The bad way v3: search from Cassandra nodes with a connector to Solr

65

Page 66: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Stand-alone search cluster caveat!The ops won’t be your friend •  3 clusters to manage: Spark, Cassandra & Solr/whatever •  lots of moving parts Impacts on Spark jobs •  increased response time due to latency •  the 99.9 percentile can be very slow

66

Page 67: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Datastax Entreprise 4.7!The right way, distributed « local search »

67

Page 68: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

val join: CassandraJoinRDD[(String,Int), (String,String)] =

sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)

// Select only useful columns for join and processing

.select("artist","year").where("solr_query = 'style:*rock* AND ratings:[3 TO *]' ")

.as((_:String, _:Int))

.repartitionByCassandraReplica(KEYSPACE, ARTISTS)

.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))

.on(SomeColumns("name")).where("solr_query = 'age:[20 TO 30]' ")

Datastax Entreprise 4.7!

①  compute Spark partitions using Cassandra token ranges ②  on each partition, use Solr for local data filtering (no distributed query!) ③  fetch data back into Spark for aggregations

68

Page 69: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Datastax Entreprise 4.7!

69

SELECT … FROM … WHERE token(#partition)> 3X/8 AND token(#partition)<= 4X/8 AND solr_query='full text search expression';

1

2

3

Advantages of same JVM Cassandra + Solr integration

1

Single-pass local full text search (no fan out) 2

Data retrieval

Token Range : ] 3X/8, 4X/8]

Page 70: Spark cassandra connector.API, Best Practices and Use-Cases

@doanduyhai

Datastax Entreprise 4.7!Scalable solution: x2 volume à x2 nodes à constant processing time

70

Page 71: Spark cassandra connector.API, Best Practices and Use-Cases

Cassandra + Spark + Solr

https://github.com/doanduyhai/Cassandra-Spark-Demo

Page 72: Spark cassandra connector.API, Best Practices and Use-Cases

Q & R

! " !

Page 73: Spark cassandra connector.API, Best Practices and Use-Cases

Thank You @doanduyhai

[email protected]

https://academy.datastax.com/