Zero to Hero Data Pipeline: from MongoDB to Cassandra

Zero to Hero Data Pipeline:From MongoDB to Cassandra

Demi Ben-Ari - VP R&D @ Panorays

About Me

Demi Ben-Ari, Co-Founder & VP R&D @ Panorays● BS’c Computer Science – Academic College Tel-Aviv Yaffo● Co-Founder “Big Things” Big Data Community

In the Past:● Sr. Data Engineer - Windward● Team Leader & Sr. Java Software Engineer,

Missile defense and Alert System - “Ofek” – IAFInterested in almost every kind of technology – A True Geek

Agenda

● Data flow with MongoDB● The Problem we had @ Windward● Solution● Lessons learned from a Newbi● Conclusions

What Windward does...

Special in Windward’s Domain

Structure of the Data

● Geo Locations + Metadata

● Arriving over time

● Different types of messages being reported by satellites

● Encoded

● Might arrive later than actually transmitted

Data Flow Diagram

External Data

Source

Analytics Layers

Data Pipeline

Parsed Raw

Entity Resolution Process

Building insightson top of the entities

Data Output Layer

Anomaly Detection

Trends

UI for End Users

Environment Description

Cluster

Dev Testing Live Staging ProductionEnv

RESTful Java Services

Data Pipeline flow - Use case

● �Batch Apache Spark applications running every 10 - 60 minutes● � Request Rate:

○ Bursts of ~9 million requests per batch job○ Beginning – Reads

■ Serving the data to other services and clients○ End - Writes

MongoDB + Spark

Worker 1

Worker 2

Worker N

Spark Cluster

Master

MongoDBReplica Set

Spark Slave - Server Specs

● �Instance Type: r3.xlarge

● �CPU’s: 4

● �RAM: 30.5GB

● �Storage: ephemeral

● �Amount: 10+

MongoDB - Server Specs

● �MongoDB version: 2.6.1● �Instance Type: m3.xlarge (AWS)● �CPU’s: 4● �RAM: 15GB● �Storage: EBS● �DB Size: ~500GB● �Collection Indexes: 5 (4 compound)

Situations & Problems

The Problem

● �Batch jobs○ Should run for 5-10 minutes in total○ Actual - runs for ~40 minutes

● �Why?○ ~20 minutes to write with the Java mongo driver – Async

(Unacknowledged)○ ~20 minutes to sync the journal○ Total: ~ 40 Minutes of the DB being unavailable○ No batch process response and no UI serving

Alternative Solutions

● �Shareded MongoDB (With replica sets)○ Pros:

■ Increases Throughput by the amount of shards■ �Increases the availability of the DB

○ Cons:■ �Very hard to manage DevOps wise (for a small

team of developers)■ High cost of servers – because each shared need

3 replicas

MongoDB + Spark

Worker 1

Worker 2

Worker N

Spark Cluster

Master

MasterSharded MongoDB

Replica Set

Our DevOps guy after that Solution

We had no DevOps guy at

that time

● ��DynamoDB (We’re hosted on Amazon)○ Pros:

■ No need to manage DevOps

○ Cons:■ Catholic Wedding Amazon’s Service

■ Not enough usage use cases

■ �Might get to a high cost for the service

● �Apache Cassandra○ Pros:

■ �Very large developer community■ �Linearly scalable Database■ �No single master architecture■ Proven working with distributed engines like Apache Spark

○ Cons:■ ��We had no experience at all with the Database■ No Geo Spatial Index – Needed to implement by ourselves

The Solution

● �Migration to Apache Cassandra (Steps)○ Writing to Mongo and Cassandra simultaneously○ Create easily a Cassandra cluster using DataStax Community

AMI on AWS● First easy step – Using the spark-cassandra-connector

○ �(Easy bootstrap move to Spark <=> Cassandra)● Creating a monitoring dashboard to Cassandra

Cassandra + Spark

Worker 1

Worker 2

Worker N

Cassandra Cluster

Spark Cluster

Cassandra + Serving

Cassandra Cluster

UI ClientUI Client

Web ServiceWeb

ServiceWeb ServiceWeb

Service

Result

● �Performance improvement○ Batch write parts of the job run in 3 minutes

instead of ~ 40 minutes in MongoDB● �Took 2 weeks to go from “Zero to Hero”, and to ramp

up a running solution that work without glitches

Another Problem?

Transferring the Heaviest Process

● Micro service that runs every 10 minutes● Writes to Cassandra 30GB per iteration

○ (Replication factor 3 => 90GB)● At first took us 18 minutes to do all of the writes

○ Not Acceptable in a 10 minute process

Cluster on OpsCenter Before

Transferring the Heaviest Process

● Solutions○ We chose the i2.xlarge○ Optimization of the Cluster○ Changing the JDK to Java-8

■ Changing the GC algorithm to G1○ Tuning the Operation system

■ Ulimit, removing the swap○ Write time went down to ~5 minutes (For 30GB RF=3)

Sounds good right? I don’t think so

Cluster on OpsCenter Before

The Solution

● Taking the same Data Model that we held in Cassandra (All of the Raw data per 10 minutes) and put it on S3○ Write time went down from ~5 minutes to 1.5

minutes● Added another process, not dependent on the main

one, happens every 15 minutes○ Reads from S3, downscales the amount and Writes

them to Cassandra for serving

How it looks after all?

Parsed Raw

Static / Aggregated

Spark Analytics Layers

UI Serving

Downscaled Data

Heavy Fusion

Process

Lessons Learned From a Newbi

Lessons Learned

● ��Use TokenAwarePolicy when connecting to the cluster – Spreads the load on the coordinators

Cluster cluster = null;

Builder builder = Cluster.builder()

.withSocketOptions(socketOptions);

builder = builder.withLoadBalancingPolicy(

new TokenAwarePolicy(new DCAwareRoundRobinPolicy()));

cluster = builder.build();

Lessons Learned

● �Monitor everything!!! – All of the Metrics

○ Cassandra

○ JVM

○ OS

● �Feature flag every parameter to the connection, you’ll need it for tuning later

Monitoring Cassandra

● OpsCenter - by DataStax

● Is the enough?...

We can connect it to Graphite also (Blog: “Monitoring the hell out of

Cassandra”)

● Plug & Play the metrics to Graphite - Internal Cassandra mechanism

● Back to the basics: dstat, iostat, iotop, jstack

● Graphite + Grafana

CQL Driver

● Actions they usually do:○ Open connection○ Applying actions in Databases

■ Select, Insert, Update, Delete○ Close connection

● Do you monitor each?○ Hint: Yes!!!! Hell Yes!!!

● Creating a wrapper in any programming language and reporting the metrics○ Count, execution times, errors…○ Infrastructure code that will give great visibility

Lessons Learned

● ��CQL Queries○ Once we got to know our data model better

● It got more efficient performance wise to use CQL statement instead of the “spark-cassandra-connector”○ Prepared Statements, Delete queries (of full partitions), Range

queries…

Lessons Learned

● �“nodetool” is your friend○ tpstats, cfhistograms, cfstats…

● �Data Modeling○ Time series data○ Evenly distributed partitions○ Everything becomes more rigid

● �Know your queries before you model

Lessons Learned

● ��DevCenter – By DataStax - Free

● �Dbeaver – Free & Open Source

○ Supports a wide variety of DataBases

● Redash - http://redash.io/ ○ Open Source:

https://github.com/getredash/redash

Conclusions

● ��Cassandra is a great linear scaling Distributed Database

● �Monitor as much as you can

○ Get visibility of what’s going on in the Cluster

● �Data Modeling correctly is the Key for success

● �Be ready for your next war

○ Cassandra performance tuning – You’ll get to that for sure

Questions?

Demi Ben-Ari

● LinkedIn● Twitter: @demibenari● Blog:

http://progexc.blogspot.com/● demi.benari@gmail.com

● “Big Things” Community

�Meetup, YouTube, Facebook, Twitter

● GDG Cloud

Zero to Hero Data Pipeline: from MongoDB to Cassandra

Software

Failover Characteristics of Several Leading NoSQL …...Failover Characteristics of leading NoSQL databases: Aerospike, Cassandra, Couchbase and MongoDB Denis Nelubin, Director of

CrateDB vs. NoSQL Comparisongo.cratedb.com/rs/...Cassandra-MongoDB-Comparison.pdf · NoSQL databases like MongoDB and Apache Cassandra have pushed database technology beyond the schema,

NoSQL Databases: MongoDB vs Cassandraweb.cs.wpi.edu/~cs585/s17/StudentsPresentations...databases, MongoDB and Cassandra is also described. The experimental evaluation of both databases

MongoDB to Cassandra

TDC2016POA | Trilha BigData - Orquestrando Hadoop, Cassandra e MongoDB com o Pentaho Big Data Analytics

MongoDB/Cassandra - csuohio.educis.csuohio.edu/~sschung/cis612/Lecture_Notes_MongoDB_Cassandr… · MongoDB/Cassandra SUNNIE CHUNG CIS 612. MongoDB ... MongoDB’shigh availability

Introduction to Big Data with Apache Spark - edX€¦ · Apache River, MongoDB, Apache Cassandra, Apache CouchDB,,…" +ACID = Atomicity, Consistency, Isolation and Durability" *CAP

Intelligence Artificielle, Macroéconomie & Finance ECONOMICS … · Big Data Machine Learning Smart Data R Python Cassandra MongoDB MySQL Algorithmes Génétiques Forêts Aléatoires

Experimentell jämförelse av Cassandra, MongoDB och MySQL831076/FULLTEXT01.pdf · operationer i CRUD (Create, Read Update, Delete) [4] på mer komplexa datamodeller. Jämförelser

Oracle Cloud Infrastructure Storage Services · •NoSQL databases(e.g. Cassandra, MongoDB, Redis), •in-memory databases, •Scale-out transactional databases, •Data warehousing

Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

Base de Datos NoSQL: MongoDB vs. Cassandra en operaciones

NoSQL Databases : MongoDB vs Cassandra

NoSQL Failover Characteristics - ben stopford€¦ · NoSQL Failover Characteristics: Aerospike, Cassandra, Couchbase, MongoDB Denis Nelubin, Director of Technology, Thumbtack Technology

NoSQL Performance Test - bankmark€¦ · 5 4 SETUP PROCEDURE Three systems were benchmarked using YCSB, namely Apache Cassandra, MongoDB and SequoiaDB. In the …

Tomorrow’s Enterprise - Delivered Todaycfs22.simplicdn.net/ice9/docs/SL_Focus_Categories.pdf · Scala, Hadoop 2.7, Cassandra, Pig, Hive, Impala, Kafka, MongoDB, Storm Training in

Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

Partners for success · Oracle DB MongoDB Cassandra MySQL MySQL Cluster Riak PostgreSQL Voldemort Redis. Data Model Relational Object Column-Family Oracle DB Riak Cassandra MySQL

Anatoly Kulakov - · PDF fileAnatoly Kulakov. 1. 2 ... • OpenTSDB by 4x 44 InfluxDB MongoDB Cassandra Elasticsearch OpenTSDB 45. 46

SUPER HERO DIET Created by: Andreana Jordan, Matthew Miller, Doug Rautio, Cassandra Reeves, Nat Williams