46
Zero to Hero Data Pipeline: From MongoDB to Cassandra Demi Ben-Ari - VP R&D @ Panorays

Zero to Hero Data Pipeline: from MongoDB to Cassandra

Embed Size (px)

Citation preview

Page 1: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Zero to Hero Data Pipeline:From MongoDB to Cassandra

Demi Ben-Ari - VP R&D @ Panorays

Page 2: Zero to Hero Data Pipeline: from MongoDB to Cassandra

About Me

Demi Ben-Ari, Co-Founder & VP R&D @ Panorays● BS’c Computer Science – Academic College Tel-Aviv Yaffo● Co-Founder “Big Things” Big Data Community

In the Past:● Sr. Data Engineer - Windward● Team Leader & Sr. Java Software Engineer,

Missile defense and Alert System - “Ofek” – IAFInterested in almost every kind of technology – A True Geek

Page 3: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Agenda

● Data flow with MongoDB● The Problem we had @ Windward● Solution● Lessons learned from a Newbi● Conclusions

Page 4: Zero to Hero Data Pipeline: from MongoDB to Cassandra

What Windward does...

Page 5: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Special in Windward’s Domain

Page 6: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Structure of the Data

● Geo Locations + Metadata

● Arriving over time

● Different types of messages being reported by satellites

● Encoded

● Might arrive later than actually transmitted

Page 7: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Data Flow Diagram

External Data

Source

Analytics Layers

Data Pipeline

Parsed Raw

Entity Resolution Process

Building insightson top of the entities

Data Output Layer

Anomaly Detection

Trends

UI for End Users

Page 8: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Environment Description

Cluster

Dev Testing Live Staging ProductionEnv

OB1K

RESTful Java Services

Page 9: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Data Pipeline flow - Use case

● �Batch Apache Spark applications running every 10 - 60 minutes● � Request Rate:

○ Bursts of ~9 million requests per batch job○ Beginning – Reads

■ Serving the data to other services and clients○ End - Writes

Page 10: Zero to Hero Data Pipeline: from MongoDB to Cassandra

MongoDB + Spark

Worker 1

Worker 2

….

….

Worker N

Spark Cluster

Master

Write

Read

MongoDBReplica Set

Page 11: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Spark Slave - Server Specs

● �Instance Type: r3.xlarge

● �CPU’s: 4

● �RAM: 30.5GB

● �Storage: ephemeral

● �Amount: 10+

Page 12: Zero to Hero Data Pipeline: from MongoDB to Cassandra

MongoDB - Server Specs

● �MongoDB version: 2.6.1● �Instance Type: m3.xlarge (AWS)● �CPU’s: 4● �RAM: 15GB● �Storage: EBS● �DB Size: ~500GB● �Collection Indexes: 5 (4 compound)

Page 13: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Situations & Problems

Page 14: Zero to Hero Data Pipeline: from MongoDB to Cassandra

The Problem

● �Batch jobs○ Should run for 5-10 minutes in total○ Actual - runs for ~40 minutes

● �Why?○ ~20 minutes to write with the Java mongo driver – Async

(Unacknowledged)○ ~20 minutes to sync the journal○ Total: ~ 40 Minutes of the DB being unavailable○ No batch process response and no UI serving

Page 15: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Alternative Solutions

● �Shareded MongoDB (With replica sets)○ Pros:

■ Increases Throughput by the amount of shards■ �Increases the availability of the DB

○ Cons:■ �Very hard to manage DevOps wise (for a small

team of developers)■ High cost of servers – because each shared need

3 replicas

Page 16: Zero to Hero Data Pipeline: from MongoDB to Cassandra

MongoDB + Spark

Worker 1

Worker 2

….

….

Worker N

Spark Cluster

Master

Write

Read

MasterSharded MongoDB

Replica Set

Page 17: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Our DevOps guy after that Solution

We had no DevOps guy at

that time

Page 18: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Alternative Solutions

● ��DynamoDB (We’re hosted on Amazon)○ Pros:

■ No need to manage DevOps

○ Cons:■ Catholic Wedding Amazon’s Service

■ Not enough usage use cases

■ �Might get to a high cost for the service

Page 19: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Alternative Solutions

● �Apache Cassandra○ Pros:

■ �Very large developer community■ �Linearly scalable Database■ �No single master architecture■ Proven working with distributed engines like Apache Spark

○ Cons:■ ��We had no experience at all with the Database■ No Geo Spatial Index – Needed to implement by ourselves

Page 20: Zero to Hero Data Pipeline: from MongoDB to Cassandra

The Solution

● �Migration to Apache Cassandra (Steps)○ Writing to Mongo and Cassandra simultaneously○ Create easily a Cassandra cluster using DataStax Community

AMI on AWS● First easy step – Using the spark-cassandra-connector

○ �(Easy bootstrap move to Spark <=> Cassandra)● Creating a monitoring dashboard to Cassandra

Page 21: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Cassandra + Spark

Worker 1

Worker 2

….

….

Worker N

Cassandra Cluster

Spark Cluster

Write

Read

Page 22: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Cassandra + Serving

Cassandra Cluster

Write

Read

UI ClientUI Client

UI ClientUI Client

Web ServiceWeb

ServiceWeb ServiceWeb

Service

Page 23: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Result

● �Performance improvement○ Batch write parts of the job run in 3 minutes

instead of ~ 40 minutes in MongoDB● �Took 2 weeks to go from “Zero to Hero”, and to ramp

up a running solution that work without glitches

Page 24: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Another Problem?

Page 25: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Transferring the Heaviest Process

● Micro service that runs every 10 minutes● Writes to Cassandra 30GB per iteration

○ (Replication factor 3 => 90GB)● At first took us 18 minutes to do all of the writes

○ Not Acceptable in a 10 minute process

Page 26: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Cluster on OpsCenter Before

Page 27: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Transferring the Heaviest Process

● Solutions○ We chose the i2.xlarge○ Optimization of the Cluster○ Changing the JDK to Java-8

■ Changing the GC algorithm to G1○ Tuning the Operation system

■ Ulimit, removing the swap○ Write time went down to ~5 minutes (For 30GB RF=3)

Sounds good right? I don’t think so

Page 28: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Cluster on OpsCenter Before

Page 29: Zero to Hero Data Pipeline: from MongoDB to Cassandra

The Solution

● Taking the same Data Model that we held in Cassandra (All of the Raw data per 10 minutes) and put it on S3○ Write time went down from ~5 minutes to 1.5

minutes● Added another process, not dependent on the main

one, happens every 15 minutes○ Reads from S3, downscales the amount and Writes

them to Cassandra for serving

Page 30: Zero to Hero Data Pipeline: from MongoDB to Cassandra

How it looks after all?

Parsed Raw

Static / Aggregated

Data

Spark Analytics Layers

UI Serving

Downscaled Data

Heavy Fusion

Process

Page 31: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Lessons Learned From a Newbi

Page 32: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Lessons Learned

● ��Use TokenAwarePolicy when connecting to the cluster – Spreads the load on the coordinators

Cluster cluster = null;

Builder builder = Cluster.builder()

.withSocketOptions(socketOptions);

builder = builder.withLoadBalancingPolicy(

new TokenAwarePolicy(new DCAwareRoundRobinPolicy()));

cluster = builder.build();

Page 33: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Lessons Learned

● �Monitor everything!!! – All of the Metrics

○ Cassandra

○ JVM

○ OS

● �Feature flag every parameter to the connection, you’ll need it for tuning later

Page 34: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Monitoring Cassandra

Page 36: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Monitoring Cassandra

● Is the enough?...

We can connect it to Graphite also (Blog: “Monitoring the hell out of

Cassandra”)

● Plug & Play the metrics to Graphite - Internal Cassandra mechanism

● Back to the basics: dstat, iostat, iotop, jstack

Page 37: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Monitoring Cassandra

Page 38: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Monitoring Cassandra

● Graphite + Grafana

Page 39: Zero to Hero Data Pipeline: from MongoDB to Cassandra

CQL Driver

● Actions they usually do:○ Open connection○ Applying actions in Databases

■ Select, Insert, Update, Delete○ Close connection

● Do you monitor each?○ Hint: Yes!!!! Hell Yes!!!

● Creating a wrapper in any programming language and reporting the metrics○ Count, execution times, errors…○ Infrastructure code that will give great visibility

Page 40: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Lessons Learned

● ��CQL Queries○ Once we got to know our data model better

● It got more efficient performance wise to use CQL statement instead of the “spark-cassandra-connector”○ Prepared Statements, Delete queries (of full partitions), Range

queries…

Page 41: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Lessons Learned

● �“nodetool” is your friend○ tpstats, cfhistograms, cfstats…

● �Data Modeling○ Time series data○ Evenly distributed partitions○ Everything becomes more rigid

● �Know your queries before you model

Page 42: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Lessons Learned

● ���DevCenter – By DataStax - Free

● �Dbeaver – Free & Open Source

○ Supports a wide variety of DataBases

● Redash - http://redash.io/ ○ Open Source:

https://github.com/getredash/redash

Page 43: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Conclusions

● ����Cassandra is a great linear scaling Distributed Database

● �Monitor as much as you can

○ Get visibility of what’s going on in the Cluster

● �Data Modeling correctly is the Key for success

● �Be ready for your next war

○ Cassandra performance tuning – You’ll get to that for sure

Page 44: Zero to Hero Data Pipeline: from MongoDB to Cassandra

Questions?

Demi Ben-Ari

Page 46: Zero to Hero Data Pipeline: from MongoDB to Cassandra