Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Quick dive into theBig Data pool

without drowning

Demi Ben-Ari - VP R&D @ Panorays

About Me

Demi Ben-Ari, Co-Founder & VP R&D @ Panorays● BS’c Computer Science – Academic College Tel-Aviv Yaffo● Co-Founder “Big Things” Big Data Community

In the Past:● Sr. Data Engineer - Windward● Team Leader & Sr. Java Software Engineer,

Missile defense and Alert System - “Ofek” – IAFInterested in almost every kind of technology – A True Geek

Agenda

● Basic Concepts

● Introduction to Big Data frameworks

● Distributed Systems => Problems

● Monitoring

● Conclusions

Say “Distributed”, Say “Big Data”,Say….

Some basic concepts

What is Big Data (IMHO)?

● Systems involving the “3 Vs”:

What are the right questions we want to ask?

○ Volume - How much?

○ Velocity - How fast?

○ Variety - What kind? (Difference)

What is Big Data (IMHO)

● Some define it the “7 Vs”○ Variability (constantly changing)○ Veracity (accuracy)○ Visualization ○ Value

What is Big Data (IMHO)

● Characteristics○ Multi-region availability ○ Very fast and reliable response○ No single point of failure

Why Not Relational Data

● Relational Model Provides○ Normalized table schema○ Cross table joins○ ACID compliance (Atomicity, Consistency, Isolation, Durability)

● But at very high cost○ Big Data table joins - bilions of rows - massive overhead○ Sharding tables across systems is complex and fragile

● Modern applications have different priorities○ Needs for speed and availability come over consistency○ Commodity servers racks trump massive high-end systems○ Real world need for transactional guarantees is limited

What strategies help manage Big Data?

● Distribute data across nodes○ Replication

● Relax consistency requirements● Relax schema requirements● Optimize data to suit actual needs

What is the NoSQL landscape?

● 4 broad classes of non-relational databases (DB-Engines)○ Graph: data elements each relate to N others in graph / network○ Key-Value: keys map to arbitrary values of any data type○ Document: document sets (JSON) queryable in whole or part ○ Wide column Store (Column Family): keys mapped to sets of

n-numbers of typed columns● Three key factors to help understand the subject

○ Consistency: Get identical results, regardless which node is queried?○ Availability: Respond to very high read and write volumes?○ Partition tolerance: Still available when part of it is down?

What is the CAP theorem? ● In distributed systems, consistency, availability and partition tolerance exist in

a manually dependant relationship, Pick any two.

Availability

Partition toleranceConsistency

MySQL, PostgreSQL,Greenplum, Vertica,

Cassandra, DynamoDB, Riak,

CouchDB, Voldemort

HBase, MongoDB, Redis, BigTable, BerkeleyDB

GraphKey-Value

Wide Column RDBMS

DB Engines - Comparison● http://db-engines.com/en/ranking

DB Engines - Comparison

What does DevOps really mean?

DevelopmentSoftware EngineeringUX

OperationsSystem AdminDatabase Admin

What does DevOps really mean?

DevOpsCross-functional teams

Operators automating systemsDevelopers operating systems

Introduction to Big Data

Frameworks

https://d152j5tfobgaot.cloudfront.net/wp-content/uploads/2015/02/yourstory_BigData.jpg

Characteristics of Hadoop

● A system to process very large amounts of unstructured and complex data with wanted speed

● A system to run on a large amount of machines that don’t share any memory or disk

● A system to run on a cluster of machines which can put together in relatively lower cost and easier maintenance

Hadoop Principals● “A system to move the computation, where the data is”

● Key Concepts of Hadoop

Flexibility Scalability

Low cost Fault Tolerant

Hadoop Core Components

● HDFS - Hadoop Distributed File System○ Provides a distributed data storage system to store data in smaller

blocks in a fail safe manner● MapReduce - Programming framework

○ Has the ability to take a query over a dataset, divide it and run in in parallel on multiple nodes

● YARN - (Yet Another Resource Negotiator) MRv2○ Splitting a MapReduce Job Tracker’s info

■ Resource Manager (Global)■ Application Manager (Per application)

Hadoop Ecosystem

Hadoop Core

HDFS MapReduce /YARN

Hadoop Common

Hadoop Applications

Hive Pig HBase Oozie Zookeeper Sqoop Spark

Hadoop (+Spark) Distributions

Elastic MapReduce DataProc

New Age BI Applications

● Able to understand various types of data● Ability to clean the data● Process data with applied rules locally and in distributed environment● Visualize sizeable data with speed● Extend results by sharing within the enterprise

Big Data Analytics

● Processing large amounts of data without data movement● Avoid data connectors if possible (run natively)● Ability to understand vast amount of data types and and data

compressions● Ability to process data on variety of processing frameworks ● Distributed data processing

○ In-Memory a big plus● Super fast visualization

○ In-Memory a big plus

When to choose hadoop?

● Large volumes of data to store and process● Semi-Structured or Unstructured data● Data is not well categorized● Data contains a lot of redundancy● Data arrives in streams or large batches● Complex batch jobs arriving in parallel● You don’t know how the data might be useful

Distributed Systems => Problems

https://imgflip.com/i/1ap5krhttp://kingofwallpapers.com/otter/otter-004.jpg

Monolith Structure

OS CPU Memory Disk

Processes Java Application Server

Database

Web Server

Load Balancer

Users - Other Applications

Monitoring System

Many times...all of this was on a single physical server!

Distributed Microservices Architecture

Service A

Service B

DBCache

Cache DBService C

Web Server

Analytics Cluster

Master

Slave Slave Slave

Monitoring System???

MongoDB + Spark

Worker 1

Worker 2

Worker N

Spark Cluster

Master

MasterSahrded MongoDB

Replica Set

Cassandra + Spark

Worker 1

Worker 2

Worker N

Cassandra Cluster

Spark Cluster

Cassandra + Serving

Cassandra Cluster

UI ClientUI Client

Web ServiceWeb

ServiceWeb ServiceWeb

Service

Problems

● Multiple physical servers

● Multiple logical services

● Want Scaling => More Servers

● Even if you had all of the metrics○ You’ll have an overflow of the data

● Your monitoring becomes a “Big Data” problem itself

This is what “Distributed” really Means

The DevOps Guy

(It might be you)

Monitoring is Crucial

http://memeguy.com/photo/46871/you-are-being-monitored

Monitoring Operation SystemMetrics

Some help from “the Cloud”

AWS’s CloudWatch / GCP StackDriver

Report to Where?

● We chose: ● Graphite (InfluxDB) + Grafana● Can correlate System and

Application metrics in one place :)

Monitoring Cassandra

● OpsCenter - by DataStax

Monitoring Cassandra

Monitoring Spark

Ways to Monitoring Spark

● Grafana-spark-dashboards

○ Blog: http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/

● Spark UI - Online on each application running● Spark History Server - Offline (After application finishes)● Spark REST API

○ Querying via inner tools to do ad-hoc monitoring

● Back to the basics: dstat, iostat, iotop, jstack● Blog post by Tzach Zohar - “Tips from the Trenches”

Monitoring Your Data

https://memegenerator.net/instance/53617544

Data Questions? What should be measure

● Did all of the computation occur?

○ Are there any data layers missing?

● How much data do we have? (Volume)

● Is all of the data in the Database?

● Data Quality Assurance

Data Answers!● The method doesn’t really matter, as long as you:

○ Can follow the results over time

○ Know what your data flow, know what might fail

○ It’s easy for anyone to add more monitoring(For the ones that add the new data each time…)

○ It don’t trust others to add monitoring

(It will always end up the DevOps’s “fault” -> No monitoring will be

applied)

Logging?Monitoring?

https://lh4.googleusercontent.com/DFVcH-E5XKj8cbhEtI0qabmf_wwVqWWvk0pK5H5rnC_kVxY2tXClKfzV-LvAH61YRLJUEvtO9amjWfjcY4Z57VBYCuQ95_hdAVEHgLAuepJiArH0wJERWuzzmgnPysCiIA

ELK - Elasticsearch + Logstash + Kibana

http://www.digitalgov.gov/2014/05/07/analyzing-search-data-in-real-time-to-drive-decisions/

Monitoring Stack

Alerting

Metrics Collection

Datastore

Dashboard

Data Monitoring

Log Monitoring

Big Data - Are we there yet?

● “3 Vs”: - What are the right questions we want to ask?○ Volume - How much?

■ Can it run on a single machine in reasonable time?

○ Velocity - How fast?

■ Can a single machine handle the throughput?

○ Variety - What kind? (Difference)

■ Is your data not changing and varying?

● If the answer for most of the previous questions is “Yes”?Think again if you want to add the complexity of “Big Data”

Conclusions

● Think carefully before going into the “Big Data pool”○ See if you really have a problem that you’re trying to solve○ It’s not a silver bullet

● Take measures to automate and monitor everything● Having Clusters and distributed frameworks will cost a lot - eventually● Fit your storage layer(s) to the needs

Questions?

https://www.stayathomemum.com.au/wp-content/uploads/2015/01/DDDDDD.jpg

Still feel like you’re drowning?

● LinkedIn● Twitter: @demibenari● Blog:

http://progexc.blogspot.com/● demi.benari@gmail.com

● “Big Things” Community

�Meetup, YouTube, Facebook, Twitter

● GDG Cloud

Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Technology

Forensik Drowning

Drowning Prevention Infographic - First 5 Ca Safety Infographic.pdfServices, and the Drowning Prevention Foundation. DROWNING IS SILENT 68% DROWNING IS THE LEADING CAUSE OF DEATH FOR

Avoid Drowning

Drowning Presentation

Drowning: Emergency Management in Children · Drowning injuries are referred to as “fatal drowning” or “nonfatal drowning”. These definitions should be - universally adopted

DROWNING & BURNS

Dry Drowning

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - Demi Ben-Ari

Acknowledgements - RLSS Direct Medallion Lifesaving Awards...qualified lifesaver. Contents Definition of Drowning 2 Drowning Statistics 2 Causes of Drowning and the Drowning Prevention

Case Drowning

Drowning Ane

Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion Rome 2017

Big Data made easy in the era of the Cloud - Demi Ben-Ari

DROWNING DEATH

Drowning Save

Drowning Tarka

NEAR DROWNING

Bootstrapping a Tech Community - Demi Ben-Ari

Nonfatal Drowning

Not all drowning is fatal - aectpnz.org€¦ · Drowning definitions Drowning is the process of experiencing respiratory impairment from submersion/immersion in liquid •Drowning