Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Preview:

Citation preview

Quick dive into theBig Data pool

without drowning

Demi Ben-Ari - VP R&D @ Panorays

About Me

Demi Ben-Ari, Co-Founder & VP R&D @ Panorays● BS’c Computer Science – Academic College Tel-Aviv Yaffo● Co-Founder “Big Things” Big Data Community

In the Past:● Sr. Data Engineer - Windward● Team Leader & Sr. Java Software Engineer,

Missile defense and Alert System - “Ofek” – IAFInterested in almost every kind of technology – A True Geek

Agenda

● Basic Concepts

● Introduction to Big Data frameworks

● Distributed Systems => Problems

● Monitoring

● Conclusions

Say “Distributed”, Say “Big Data”,Say….

Some basic concepts

What is Big Data (IMHO)?

● Systems involving the “3 Vs”:

What are the right questions we want to ask?

○ Volume - How much?

○ Velocity - How fast?

○ Variety - What kind? (Difference)

What is Big Data (IMHO)

● Some define it the “7 Vs”○ Variability (constantly changing)○ Veracity (accuracy)○ Visualization ○ Value

What is Big Data (IMHO)

● Characteristics○ Multi-region availability ○ Very fast and reliable response○ No single point of failure

Why Not Relational Data

● Relational Model Provides○ Normalized table schema○ Cross table joins○ ACID compliance (Atomicity, Consistency, Isolation, Durability)

● But at very high cost○ Big Data table joins - bilions of rows - massive overhead○ Sharding tables across systems is complex and fragile

● Modern applications have different priorities○ Needs for speed and availability come over consistency○ Commodity servers racks trump massive high-end systems○ Real world need for transactional guarantees is limited

What strategies help manage Big Data?

● Distribute data across nodes○ Replication

● Relax consistency requirements● Relax schema requirements● Optimize data to suit actual needs

What is the NoSQL landscape?

● 4 broad classes of non-relational databases (DB-Engines)○ Graph: data elements each relate to N others in graph / network○ Key-Value: keys map to arbitrary values of any data type○ Document: document sets (JSON) queryable in whole or part ○ Wide column Store (Column Family): keys mapped to sets of

n-numbers of typed columns● Three key factors to help understand the subject

○ Consistency: Get identical results, regardless which node is queried?○ Availability: Respond to very high read and write volumes?○ Partition tolerance: Still available when part of it is down?

What is the CAP theorem? ● In distributed systems, consistency, availability and partition tolerance exist in

a manually dependant relationship, Pick any two.

Availability

Partition toleranceConsistency

MySQL, PostgreSQL,Greenplum, Vertica,

Neo4J

Cassandra, DynamoDB, Riak,

CouchDB, Voldemort

HBase, MongoDB, Redis, BigTable, BerkeleyDB

GraphKey-Value

Wide Column RDBMS

DB Engines - Comparison● http://db-engines.com/en/ranking

DB Engines - Comparison

What does DevOps really mean?

DevelopmentSoftware EngineeringUX

OperationsSystem AdminDatabase Admin

What does DevOps really mean?

DevOpsCross-functional teams

Operators automating systemsDevelopers operating systems

Introduction to Big Data

Frameworks

https://d152j5tfobgaot.cloudfront.net/wp-content/uploads/2015/02/yourstory_BigData.jpg

Characteristics of Hadoop

● A system to process very large amounts of unstructured and complex data with wanted speed

● A system to run on a large amount of machines that don’t share any memory or disk

● A system to run on a cluster of machines which can put together in relatively lower cost and easier maintenance

Hadoop Principals● “A system to move the computation, where the data is”

● Key Concepts of Hadoop

Flexibility Scalability

Low cost Fault Tolerant

Hadoop Core Components

● HDFS - Hadoop Distributed File System○ Provides a distributed data storage system to store data in smaller

blocks in a fail safe manner● MapReduce - Programming framework

○ Has the ability to take a query over a dataset, divide it and run in in parallel on multiple nodes

● YARN - (Yet Another Resource Negotiator) MRv2○ Splitting a MapReduce Job Tracker’s info

■ Resource Manager (Global)■ Application Manager (Per application)

Hadoop Ecosystem

Hadoop Core

HDFS MapReduce /YARN

Hadoop Common

Hadoop Applications

Hive Pig HBase Oozie Zookeeper Sqoop Spark

Hadoop (+Spark) Distributions

Elastic MapReduce DataProc

New Age BI Applications

● Able to understand various types of data● Ability to clean the data● Process data with applied rules locally and in distributed environment● Visualize sizeable data with speed● Extend results by sharing within the enterprise

Big Data Analytics

● Processing large amounts of data without data movement● Avoid data connectors if possible (run natively)● Ability to understand vast amount of data types and and data

compressions● Ability to process data on variety of processing frameworks ● Distributed data processing

○ In-Memory a big plus● Super fast visualization

○ In-Memory a big plus

When to choose hadoop?

● Large volumes of data to store and process● Semi-Structured or Unstructured data● Data is not well categorized● Data contains a lot of redundancy● Data arrives in streams or large batches● Complex batch jobs arriving in parallel● You don’t know how the data might be useful

Distributed Systems => Problems

https://imgflip.com/i/1ap5krhttp://kingofwallpapers.com/otter/otter-004.jpg

Monolith Structure

OS CPU Memory Disk

Processes Java Application Server

Database

Web Server

Load Balancer

Users - Other Applications

Monitoring System

UI

Many times...all of this was on a single physical server!

Distributed Microservices Architecture

Service A

Queue

DB

Service B

DBCache

Cache DBService C

Web Server

DB

Analytics Cluster

Master

Slave Slave Slave

Monitoring System???

MongoDB + Spark

Worker 1

Worker 2

….

….

Worker N

Spark Cluster

Master

Write

Read

MasterSahrded MongoDB

Replica Set

Cassandra + Spark

Worker 1

Worker 2

….

….

Worker N

Cassandra Cluster

Spark Cluster

Write

Read

Cassandra + Serving

Cassandra Cluster

Write

Read

UI ClientUI Client

UI ClientUI Client

Web ServiceWeb

ServiceWeb ServiceWeb

Service

Problems

● Multiple physical servers

● Multiple logical services

● Want Scaling => More Servers

● Even if you had all of the metrics○ You’ll have an overflow of the data

● Your monitoring becomes a “Big Data” problem itself

This is what “Distributed” really Means

The DevOps Guy

(It might be you)

Monitoring is Crucial

http://memeguy.com/photo/46871/you-are-being-monitored

Monitoring Operation SystemMetrics

Some help from “the Cloud”

AWS’s CloudWatch / GCP StackDriver

Report to Where?

● We chose: ● Graphite (InfluxDB) + Grafana● Can correlate System and

Application metrics in one place :)

Monitoring Cassandra

Monitoring Cassandra

Monitoring Spark

Ways to Monitoring Spark

● Grafana-spark-dashboards

○ Blog: http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/

● Spark UI - Online on each application running● Spark History Server - Offline (After application finishes)● Spark REST API

○ Querying via inner tools to do ad-hoc monitoring

● Back to the basics: dstat, iostat, iotop, jstack● Blog post by Tzach Zohar - “Tips from the Trenches”

Monitoring Your Data

https://memegenerator.net/instance/53617544

Data Questions? What should be measure

● Did all of the computation occur?

○ Are there any data layers missing?

● How much data do we have? (Volume)

● Is all of the data in the Database?

● Data Quality Assurance

Data Answers!● The method doesn’t really matter, as long as you:

○ Can follow the results over time

○ Know what your data flow, know what might fail

○ It’s easy for anyone to add more monitoring(For the ones that add the new data each time…)

○ It don’t trust others to add monitoring

(It will always end up the DevOps’s “fault” -> No monitoring will be

applied)

ELK - Elasticsearch + Logstash + Kibana

http://www.digitalgov.gov/2014/05/07/analyzing-search-data-in-real-time-to-drive-decisions/

Monitoring Stack

Alerting

Metrics Collection

Datastore

Dashboard

Data Monitoring

Log Monitoring

Big Data - Are we there yet?

● “3 Vs”: - What are the right questions we want to ask?○ Volume - How much?

■ Can it run on a single machine in reasonable time?

○ Velocity - How fast?

■ Can a single machine handle the throughput?

○ Variety - What kind? (Difference)

■ Is your data not changing and varying?

● If the answer for most of the previous questions is “Yes”?Think again if you want to add the complexity of “Big Data”

Conclusions

● Think carefully before going into the “Big Data pool”○ See if you really have a problem that you’re trying to solve○ It’s not a silver bullet

● Take measures to automate and monitor everything● Having Clusters and distributed frameworks will cost a lot - eventually● Fit your storage layer(s) to the needs

Questions?

https://www.stayathomemum.com.au/wp-content/uploads/2015/01/DDDDDD.jpg

Still feel like you’re drowning?