Cassandra + Spark + Elk

Cassandra+Spark+ELKDmitriy Kalyada @ 2015

What is Spark?• Master: Driver program

• Workers: Executors

• High Availability

• Standby Masters with ZooKeeper

• Single-Node Recovery with Local File System

Under the hood

• Resilient Distributed Dataset (RDD)

• Scala + Akka Framework

• Java, Scala, Python API

• Spark SQL, MLib, Spark Streaming, GraphX

Our particular caseDevices

Cassandra

Data flow

Fetcher Transformer Saver

Input Source(s)

x-RDD x-RDD

Output Source

Spark Cassandra Connector

• Represents Cassandra tables as Spark RDDs

• Write Spark RDDs to Cassandra tables

• Execute CQL queries in Spark applications

https://github.com/datastax/spark-cassandra-connector

CassandraRDD settings• Connection params

• Fetching params

1. input.split.size: C* partitions in a Spark Partition.

2. input.page.row.size: number of CQL rows fetched per roundtrip.

Fetching essentials

…-968391295277638458 … -893783532241185833

-968391295277638458, -893783532241185833 -7378580094811526501, -7340240117176401239 6426215139012569257, 6428979455828914106

-6094480671546553265, -6016282219056649738 -7259249675596554667, -7237838231745167324 -6734336817058726139, -6684208157211348972 -3891103372671105499, -3822513456325086923

4453206019575747361,4462441725813855391 7855385326468991461,7906589648045207141 -129433796439502583,-101280166181350027

-2233788032218452383,-2066644620711092198 3248662132571799756,3396129453515776704 7744134136205124749,7812918342246679728

-1408208314239486033,-1403736406052004344

• Support Murmur3Partitioner and RandomPartitioner

• Retrieve token ranges from Cassandra

• Prediction on base of 16 random token ranges

Data to RDD

Tokens Per RDD [input.split.size]

Token Range #N

Slurp amount [input.page.row.size]

Token range vs rows number

What to do?

• Change read strategy

• Split data on a smaller pieces

• Increase cluster strength

• Reorganize Cassandra schema

Elastic Search

Elastic Search & Kibana• Index initialization: TransportClient

• Create/Delete Index

• Setup Mappings

• Indexing: ScalaEsRDD

• Data presentation: Kibana

Kibana

Deployment

• Build package: Spark Job + Dependent Jars + Configs

• Upload to the Spark Master Node

• Start job submit script

Thank youdkaliada@exadel.com

Dmitriy Kalyada @ 2015

Cassandra + Spark + Elk

Technology

A GUIDE TO STRESS TESTING KAFKA, SPARK AND CASSANDRA … · Spark Workers. The nodes are named Spark-Cassandra-Master, Spark-Cassandra-Worker01 and Spark-Cassandra-Worker02. The Cassandra

Data Driven Performance Repository to Classify and ... · MongoDB. Cluster-Python Driver. Cassandra - Python Driver. Python. Spark Cluster. Spark - Cassandra Connector. Spark - MongoDB

Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

Manchester Hadoop Meetup: Spark Cassandra Integration

Spark & Cassandra - DevFest Córdoba

Cassandra & Spark for IoT

Cassandra and Spark

Olap with Spark and Cassandra

Spark Streaming with Cassandra

Spark Cassandra 2016

Spark Cassandra Connector Dataframes

PySpark Cassandra - Amsterdam Spark Meetup

Spark with Cassandra by Christopher Batey

Spark zeppelin-cassandra at synchrotron

Cassandra spark connector

Spark cassandra integration, theory and practice

Analytics with Cassandra & Spark

Big data analytics with Spark & Cassandra

Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark