using Apache Spark and Zeppelin -...

Big Data Visualizationusing

Apache Spark and Zeppelin

Prajod Vettiyattil, Software Architect, Wipro

Agenda

• Big Data and Ecosystem tools

• Apache Spark

• Apache Zeppelin

• Data Visualization

• Combining Spark and Zeppelin

BIG DATA AND ECOSYSTEM TOOLS

Big Data

• Data size beyond systems capability

– Terabyte, Petabyte, Exabyte

• Storage

– Commodity servers, RAID, SAN

• Processing

– In reasonable response time

– A challenge here

Server

Tradition processing tools

• Move what ?

– the data to the code or

– the code to the data

Server

move data to code

move code to data

Traditional processing tools

• Traditional tools

– RDBMS, DWH, BI

– High cost

– Difficult to scale beyond certain data size

• price/performance skew

• data variety not supported

Map-Reduce and NoSQL

• Hadoop toolset

– Free and open source

– Commodity hardware

– Scales to exabytes(1018), maybe even more

• Not only SQL

– Storage and query processing only

– Complements Hadoop toolset

– Volume, velocity and variety

All is well ?

• Hadoop was designed for batch processing

• Disk based processing: slow

• Many tools to enhance Hadoop’s capabilities

– Distributed cache, Haloop, Hive, HBase

• Not for interactive and iterative

TOWARDS SINGULARITY

What is singularity ?

1 2 3 4 5 6 7

Decade

Decade vs AI capacity

Point of singularity

Technological singularity

• When AI capability exceeds Human capacity

• AI or non-AI singularity

• 2045: http://en.wikipedia.org/wiki/Ray_Kurzweil

– The predicted year

APACHE SPARK

History of Spark

Spark 1.3.0 released

MarchSpark 1.0.0

released. 100TB sort achieved in

23 mins

Spark donated to Apache Software

Foundation

Spark is made open source. Available on

github

Spark created by PhD student

at UC Berkeley, Matei

Zaharia

Contributors in Spark

• Yahoo

• Intel

• UC Berkeley

• …

• 50+ organizations

Hadoop and Spark

• Spark complements the Hadoop ecosystem

• Replaces: Hadoop MR

• Spark integrates with

– HDFS

– Hive

– HBase

– YARN

Other big data tools

• Spark also integrates with

– Kafka

– ZeroMQ

– Cassandra

– Mesos

Programming Spark

• Java

• Scala

• Python

Spark toolset

Apache Spark

Streamin

MLlib GraphX

Spark R

Blink DB

Cassandra

Tachyon

What is Spark for ?

Interactive Streaming

The main difference: speed

• RAM access vs Disk access

– RAM access is 100,000 times faster !

https://gist.github.com/hellerbarde/2843375

Lambda Architecture pattern

• Used for Lambda architecture implementation

– Batch layer

– Speed layer

– Serving layer

Serving Layer

Data Input

consumers

Worker Node

Executor

Deployment Architecture

Master Node

Executor

Task CacheTaskTask

Worker Node

ExecutorExecutorExecutor

TaskTaskTaskCache

Spark’s Cluster

Manager

Spark Driver

HDFS Name

APACHE ZEPPELIN

Interactive data analytics

• For Spark and Flink

• Web front end

• At the back, it connects to

– SQL systems(Eg: Hive)

– Spark

– Flink

Deployment Architecture

Spark / Flink /

Zeppelin daemon

browser 1

browser 2

browser 3

Web Server

Interpreters

Optional

Remote

Interpreters

Notebook

• Is where you do your data analysis

• Web UI REPL with pluggable interpreters

• Interpreters

– Scala, Python, Angular, SparkSQL, Markdown and Shell

User Interface features

• Markdown

• Dynamic HTML generation

• Dynamic chart generation

• Screen sharing via websockets

SQL Interpreter

• SQL shell

– Query spark data using SQL queries

– Return normal text, HTML or chart type results

Scala interpreter for Spark

• Similar to the Spark shell

• Upload your data into Spark

• Query the data sets(RDDs) in your Spark server

• Execute map-reduce tasks

• Actions on RDD

• Transformations on RDD

DATA VISUALIZATION

Visualization tools

31Source: http://selection.datavisualization.ch/

D3 Visualizations

32Source: https://github.com/mbostock/d3/wiki/Gallery

The need for visualization

Big DataDo

something to data

User gets

comprehensible

Tools for Data Presentation Architecture

1.Identify

2.Locate

3.Manipulate

4.Format

5.Present

A data analysis tool/toolset would support:

COMBINING SPARK AND ZEPPELIN

Spark and Zeppelin

Worker

Master

Worker

Zeppelin daemon

browser 1

browser 2

browser 3

Web Server

Interpreters

Remote

Interpreters

Zeppelin views: Table from SQL

%sql select age, count(1) from bank where

marital="${marital=single,single|divorced|married}"

group by age order by age

Zeppelin views: Pie chart from SQL

Zeppelin views: Bar chart from SQL

Zeppelin views: Angular

Share variables: MVVM

• Between Scala/Python/Spark and Angular

• Observe scala variables from angular

Scala-Spark Angular

x = “foo” x = “bar”

Zeppelin

Screen sharing using Zeppelin

• Share your graphical reports

– Live sharing

– Get the share URL from zeppelin and share with others

– Uses websockets

• Embed live reports in web pages

FUTURE

Spark and Zeppelin

• Spark

– Machine Learning using Spark

– GraphX and MLlib

• Zeppelin

– Additional interpreters

– Better graphics

– Report persistence

– More report templates

– Better angular integration45

SUMMARY

Summary

• Spark and tools

• The need for visualization

• The role of Zeppelin

• Zeppelin – Spark integration

using Apache Spark and Zeppelin -...

Documents

Zeppelin(Spark)으로 데이터 분석하기

Apache zeppelin 0.7.0 helium

Big data with Apache spark - WUNCA · 2017-07-21 · - Apache spark architecture - Databricks community - Introduction to Big Data with Apache Spark บ าย - Apache Spark on Databricks

Welcome to Apache Zeppelin Community

Apache Zeppelin @DevoxxFR 2016

NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Data science lifecycle with Apache Zeppelin

Running Apache Spark & Apache Zeppelin in Production

Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

Developing Apache Spark Applications - Cloudera · Apache Spark Quick Start Apache Spark Overview Apache Spark Programming Guide Using the Spark DataFrame API A DataFrame is a distributed

Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark 1.5.1 Zeppelin 0.6.0

Using Apache Spark Pat McDonough - Databricks. Apache Spark spark.incubator.apache.org github.com/apache/incubator- spark user@spark.incubator.apache.or

Data Science with Spark & Zeppelin

Toulouse Data Science meetup - Apache zeppelin

[214] data science with apache zeppelin

Developing Apache Spark Applications · Apache Spark Introduction Introduction Apache Spark enables you to quickly develop applications and process jobs. Apache Spark is designed

Apache spark

Integrating Apache Hive with Kafka, Spark, and BI...Community Connection: Integrating Apache Hive with Apache Spark--Hive Warehouse Connector Apache Spark-Apache Hive connection configuration

Spark zeppelin-cassandra at synchrotron

Deep dive on Security in Apache Spark & Apache Zeppelin