Upload
demi-ben-ari
View
183
Download
2
Embed Size (px)
Citation preview
Zero to Hero Data Pipeline:From MongoDB to Cassandra
Demi Ben-Ari - VP R&D @ Panorays
About Me
Demi Ben-Ari, Co-Founder & VP R&D @ Panorays● BS’c Computer Science – Academic College Tel-Aviv Yaffo● Co-Founder “Big Things” Big Data Community
In the Past:● Sr. Data Engineer - Windward● Team Leader & Sr. Java Software Engineer,
Missile defense and Alert System - “Ofek” – IAFInterested in almost every kind of technology – A True Geek
Agenda
● Data flow with MongoDB● The Problem we had @ Windward● Solution● Lessons learned from a Newbi● Conclusions
What Windward does...
Special in Windward’s Domain
Structure of the Data
● Geo Locations + Metadata
● Arriving over time
● Different types of messages being reported by satellites
● Encoded
● Might arrive later than actually transmitted
Data Flow Diagram
External Data
Source
Analytics Layers
Data Pipeline
Parsed Raw
Entity Resolution Process
Building insightson top of the entities
Data Output Layer
Anomaly Detection
Trends
UI for End Users
Environment Description
Cluster
Dev Testing Live Staging ProductionEnv
OB1K
RESTful Java Services
Data Pipeline flow - Use case
● �Batch Apache Spark applications running every 10 - 60 minutes● � Request Rate:
○ Bursts of ~9 million requests per batch job○ Beginning – Reads
■ Serving the data to other services and clients○ End - Writes
MongoDB + Spark
Worker 1
Worker 2
….
….
…
…
Worker N
Spark Cluster
Master
Write
Read
MongoDBReplica Set
Spark Slave - Server Specs
● �Instance Type: r3.xlarge
● �CPU’s: 4
● �RAM: 30.5GB
● �Storage: ephemeral
● �Amount: 10+
MongoDB - Server Specs
● �MongoDB version: 2.6.1● �Instance Type: m3.xlarge (AWS)● �CPU’s: 4● �RAM: 15GB● �Storage: EBS● �DB Size: ~500GB● �Collection Indexes: 5 (4 compound)
Situations & Problems
The Problem
● �Batch jobs○ Should run for 5-10 minutes in total○ Actual - runs for ~40 minutes
● �Why?○ ~20 minutes to write with the Java mongo driver – Async
(Unacknowledged)○ ~20 minutes to sync the journal○ Total: ~ 40 Minutes of the DB being unavailable○ No batch process response and no UI serving
Alternative Solutions
● �Shareded MongoDB (With replica sets)○ Pros:
■ Increases Throughput by the amount of shards■ �Increases the availability of the DB
○ Cons:■ �Very hard to manage DevOps wise (for a small
team of developers)■ High cost of servers – because each shared need
3 replicas
MongoDB + Spark
Worker 1
Worker 2
….
….
…
…
Worker N
Spark Cluster
Master
Write
Read
MasterSharded MongoDB
Replica Set
Our DevOps guy after that Solution
We had no DevOps guy at
that time
Alternative Solutions
● ��DynamoDB (We’re hosted on Amazon)○ Pros:
■ No need to manage DevOps
○ Cons:■ Catholic Wedding Amazon’s Service
■ Not enough usage use cases
■ �Might get to a high cost for the service
Alternative Solutions
● �Apache Cassandra○ Pros:
■ �Very large developer community■ �Linearly scalable Database■ �No single master architecture■ Proven working with distributed engines like Apache Spark
○ Cons:■ ��We had no experience at all with the Database■ No Geo Spatial Index – Needed to implement by ourselves
The Solution
● �Migration to Apache Cassandra (Steps)○ Writing to Mongo and Cassandra simultaneously○ Create easily a Cassandra cluster using DataStax Community
AMI on AWS● First easy step – Using the spark-cassandra-connector
○ �(Easy bootstrap move to Spark <=> Cassandra)● Creating a monitoring dashboard to Cassandra
Cassandra + Spark
Worker 1
Worker 2
….
….
…
…
Worker N
Cassandra Cluster
Spark Cluster
Write
Read
Cassandra + Serving
Cassandra Cluster
Write
Read
UI ClientUI Client
UI ClientUI Client
Web ServiceWeb
ServiceWeb ServiceWeb
Service
Result
● �Performance improvement○ Batch write parts of the job run in 3 minutes
instead of ~ 40 minutes in MongoDB● �Took 2 weeks to go from “Zero to Hero”, and to ramp
up a running solution that work without glitches
Another Problem?
Transferring the Heaviest Process
● Micro service that runs every 10 minutes● Writes to Cassandra 30GB per iteration
○ (Replication factor 3 => 90GB)● At first took us 18 minutes to do all of the writes
○ Not Acceptable in a 10 minute process
Cluster on OpsCenter Before
Transferring the Heaviest Process
● Solutions○ We chose the i2.xlarge○ Optimization of the Cluster○ Changing the JDK to Java-8
■ Changing the GC algorithm to G1○ Tuning the Operation system
■ Ulimit, removing the swap○ Write time went down to ~5 minutes (For 30GB RF=3)
Sounds good right? I don’t think so
Cluster on OpsCenter Before
The Solution
● Taking the same Data Model that we held in Cassandra (All of the Raw data per 10 minutes) and put it on S3○ Write time went down from ~5 minutes to 1.5
minutes● Added another process, not dependent on the main
one, happens every 15 minutes○ Reads from S3, downscales the amount and Writes
them to Cassandra for serving
How it looks after all?
Parsed Raw
Static / Aggregated
Data
Spark Analytics Layers
UI Serving
Downscaled Data
Heavy Fusion
Process
Lessons Learned From a Newbi
Lessons Learned
● ��Use TokenAwarePolicy when connecting to the cluster – Spreads the load on the coordinators
Cluster cluster = null;
Builder builder = Cluster.builder()
.withSocketOptions(socketOptions);
builder = builder.withLoadBalancingPolicy(
new TokenAwarePolicy(new DCAwareRoundRobinPolicy()));
cluster = builder.build();
Lessons Learned
● �Monitor everything!!! – All of the Metrics
○ Cassandra
○ JVM
○ OS
● �Feature flag every parameter to the connection, you’ll need it for tuning later
Monitoring Cassandra
Monitoring Cassandra
● OpsCenter - by DataStax
Monitoring Cassandra
● Is the enough?...
We can connect it to Graphite also (Blog: “Monitoring the hell out of
Cassandra”)
● Plug & Play the metrics to Graphite - Internal Cassandra mechanism
● Back to the basics: dstat, iostat, iotop, jstack
Monitoring Cassandra
Monitoring Cassandra
● Graphite + Grafana
CQL Driver
● Actions they usually do:○ Open connection○ Applying actions in Databases
■ Select, Insert, Update, Delete○ Close connection
● Do you monitor each?○ Hint: Yes!!!! Hell Yes!!!
● Creating a wrapper in any programming language and reporting the metrics○ Count, execution times, errors…○ Infrastructure code that will give great visibility
Lessons Learned
● ��CQL Queries○ Once we got to know our data model better
● It got more efficient performance wise to use CQL statement instead of the “spark-cassandra-connector”○ Prepared Statements, Delete queries (of full partitions), Range
queries…
Lessons Learned
● �“nodetool” is your friend○ tpstats, cfhistograms, cfstats…
● �Data Modeling○ Time series data○ Evenly distributed partitions○ Everything becomes more rigid
● �Know your queries before you model
Lessons Learned
● ���DevCenter – By DataStax - Free
● �Dbeaver – Free & Open Source
○ Supports a wide variety of DataBases
● Redash - http://redash.io/ ○ Open Source:
https://github.com/getredash/redash
Conclusions
● ����Cassandra is a great linear scaling Distributed Database
● �Monitor as much as you can
○ Get visibility of what’s going on in the Cluster
● �Data Modeling correctly is the Key for success
● �Be ready for your next war
○ Cassandra performance tuning – You’ll get to that for sure
Questions?
Demi Ben-Ari
● LinkedIn● Twitter: @demibenari● Blog:
http://progexc.blogspot.com/● [email protected]
● “Big Things” Community
�Meetup, YouTube, Facebook, Twitter
● GDG Cloud