Zero to Hero Data Pipeline: from MongoDB to Cassandra

Preview:

Citation preview

Zero to Hero Data Pipeline:From MongoDB to Cassandra

Demi Ben-Ari - VP R&D @ Panorays

About Me

Demi Ben-Ari, Co-Founder & VP R&D @ Panorays● BS’c Computer Science – Academic College Tel-Aviv Yaffo● Co-Founder “Big Things” Big Data Community

In the Past:● Sr. Data Engineer - Windward● Team Leader & Sr. Java Software Engineer,

Missile defense and Alert System - “Ofek” – IAFInterested in almost every kind of technology – A True Geek

Agenda

● Data flow with MongoDB● The Problem we had @ Windward● Solution● Lessons learned from a Newbi● Conclusions

What Windward does...

Special in Windward’s Domain

Structure of the Data

● Geo Locations + Metadata

● Arriving over time

● Different types of messages being reported by satellites

● Encoded

● Might arrive later than actually transmitted

Data Flow Diagram

External Data

Source

Analytics Layers

Data Pipeline

Parsed Raw

Entity Resolution Process

Building insightson top of the entities

Data Output Layer

Anomaly Detection

Trends

UI for End Users

Environment Description

Cluster

Dev Testing Live Staging ProductionEnv

OB1K

RESTful Java Services

Data Pipeline flow - Use case

● �Batch Apache Spark applications running every 10 - 60 minutes● � Request Rate:

○ Bursts of ~9 million requests per batch job○ Beginning – Reads

■ Serving the data to other services and clients○ End - Writes

MongoDB + Spark

Worker 1

Worker 2

….

….

Worker N

Spark Cluster

Master

Write

Read

MongoDBReplica Set

Spark Slave - Server Specs

● �Instance Type: r3.xlarge

● �CPU’s: 4

● �RAM: 30.5GB

● �Storage: ephemeral

● �Amount: 10+

MongoDB - Server Specs

● �MongoDB version: 2.6.1● �Instance Type: m3.xlarge (AWS)● �CPU’s: 4● �RAM: 15GB● �Storage: EBS● �DB Size: ~500GB● �Collection Indexes: 5 (4 compound)

Situations & Problems

The Problem

● �Batch jobs○ Should run for 5-10 minutes in total○ Actual - runs for ~40 minutes

● �Why?○ ~20 minutes to write with the Java mongo driver – Async

(Unacknowledged)○ ~20 minutes to sync the journal○ Total: ~ 40 Minutes of the DB being unavailable○ No batch process response and no UI serving

Alternative Solutions

● �Shareded MongoDB (With replica sets)○ Pros:

■ Increases Throughput by the amount of shards■ �Increases the availability of the DB

○ Cons:■ �Very hard to manage DevOps wise (for a small

team of developers)■ High cost of servers – because each shared need

3 replicas

MongoDB + Spark

Worker 1

Worker 2

….

….

Worker N

Spark Cluster

Master

Write

Read

MasterSharded MongoDB

Replica Set

Our DevOps guy after that Solution

We had no DevOps guy at

that time

Alternative Solutions

● ��DynamoDB (We’re hosted on Amazon)○ Pros:

■ No need to manage DevOps

○ Cons:■ Catholic Wedding Amazon’s Service

■ Not enough usage use cases

■ �Might get to a high cost for the service

Alternative Solutions

● �Apache Cassandra○ Pros:

■ �Very large developer community■ �Linearly scalable Database■ �No single master architecture■ Proven working with distributed engines like Apache Spark

○ Cons:■ ��We had no experience at all with the Database■ No Geo Spatial Index – Needed to implement by ourselves

The Solution

● �Migration to Apache Cassandra (Steps)○ Writing to Mongo and Cassandra simultaneously○ Create easily a Cassandra cluster using DataStax Community

AMI on AWS● First easy step – Using the spark-cassandra-connector

○ �(Easy bootstrap move to Spark <=> Cassandra)● Creating a monitoring dashboard to Cassandra

Cassandra + Spark

Worker 1

Worker 2

….

….

Worker N

Cassandra Cluster

Spark Cluster

Write

Read

Cassandra + Serving

Cassandra Cluster

Write

Read

UI ClientUI Client

UI ClientUI Client

Web ServiceWeb

ServiceWeb ServiceWeb

Service

Result

● �Performance improvement○ Batch write parts of the job run in 3 minutes

instead of ~ 40 minutes in MongoDB● �Took 2 weeks to go from “Zero to Hero”, and to ramp

up a running solution that work without glitches

Another Problem?

Transferring the Heaviest Process

● Micro service that runs every 10 minutes● Writes to Cassandra 30GB per iteration

○ (Replication factor 3 => 90GB)● At first took us 18 minutes to do all of the writes

○ Not Acceptable in a 10 minute process

Cluster on OpsCenter Before

Transferring the Heaviest Process

● Solutions○ We chose the i2.xlarge○ Optimization of the Cluster○ Changing the JDK to Java-8

■ Changing the GC algorithm to G1○ Tuning the Operation system

■ Ulimit, removing the swap○ Write time went down to ~5 minutes (For 30GB RF=3)

Sounds good right? I don’t think so

Cluster on OpsCenter Before

The Solution

● Taking the same Data Model that we held in Cassandra (All of the Raw data per 10 minutes) and put it on S3○ Write time went down from ~5 minutes to 1.5

minutes● Added another process, not dependent on the main

one, happens every 15 minutes○ Reads from S3, downscales the amount and Writes

them to Cassandra for serving

How it looks after all?

Parsed Raw

Static / Aggregated

Data

Spark Analytics Layers

UI Serving

Downscaled Data

Heavy Fusion

Process

Lessons Learned From a Newbi

Lessons Learned

● ��Use TokenAwarePolicy when connecting to the cluster – Spreads the load on the coordinators

Cluster cluster = null;

Builder builder = Cluster.builder()

.withSocketOptions(socketOptions);

builder = builder.withLoadBalancingPolicy(

new TokenAwarePolicy(new DCAwareRoundRobinPolicy()));

cluster = builder.build();

Lessons Learned

● �Monitor everything!!! – All of the Metrics

○ Cassandra

○ JVM

○ OS

● �Feature flag every parameter to the connection, you’ll need it for tuning later

Monitoring Cassandra

Monitoring Cassandra

● Is the enough?...

We can connect it to Graphite also (Blog: “Monitoring the hell out of

Cassandra”)

● Plug & Play the metrics to Graphite - Internal Cassandra mechanism

● Back to the basics: dstat, iostat, iotop, jstack

Monitoring Cassandra

Monitoring Cassandra

● Graphite + Grafana

CQL Driver

● Actions they usually do:○ Open connection○ Applying actions in Databases

■ Select, Insert, Update, Delete○ Close connection

● Do you monitor each?○ Hint: Yes!!!! Hell Yes!!!

● Creating a wrapper in any programming language and reporting the metrics○ Count, execution times, errors…○ Infrastructure code that will give great visibility

Lessons Learned

● ��CQL Queries○ Once we got to know our data model better

● It got more efficient performance wise to use CQL statement instead of the “spark-cassandra-connector”○ Prepared Statements, Delete queries (of full partitions), Range

queries…

Lessons Learned

● �“nodetool” is your friend○ tpstats, cfhistograms, cfstats…

● �Data Modeling○ Time series data○ Evenly distributed partitions○ Everything becomes more rigid

● �Know your queries before you model

Lessons Learned

● ���DevCenter – By DataStax - Free

● �Dbeaver – Free & Open Source

○ Supports a wide variety of DataBases

● Redash - http://redash.io/ ○ Open Source:

https://github.com/getredash/redash

Conclusions

● ����Cassandra is a great linear scaling Distributed Database

● �Monitor as much as you can

○ Get visibility of what’s going on in the Cluster

● �Data Modeling correctly is the Key for success

● �Be ready for your next war

○ Cassandra performance tuning – You’ll get to that for sure

Questions?

Demi Ben-Ari

Recommended