33
Big Data Analysis in Java World by Serhiy Masyutin

JEEConf 2015 Big Data Analysis in Java World

Embed Size (px)

Citation preview

Big Data Analysis in Java Worldby Serhiy Masyutin

Agenda

The Big Data Problem Map-Reduce MPP Analytical Database In-Memory Data Fabric Real-Life Project Q&A

The Big Data Problem

http://www.datameer.com/images/product/big_data_hadoop/img_bigdata.png

- Doug Laney

The Big Data Problem

Map-Reduce MPP AD IMDF

When do I need it?

In an hour In a minute Now

What do I need to do with it?

Exploratory analytics

Structured analytics

Singular event processing

(some analytics),

Transactions

How will I query and search?

Unstructured Ad hoc SQL Structured

How do I need to store it?

I do, but not required to

I must and I am required to

Temporarily

Where is it coming from?

File/ETL File/ETL Event/Stream/File/

ETLhttp://blog.pivotal.io/pivotal/products/exploring-big-data-solutions-when-to-use-hadoop-vs-in-memory-vs-mpp

The Big Data Problem

Map-Reduce

MPP AD IMDF

Transactions

Customer records

Geo-spatial

Sensors

Social Media

XML, JSON

Raw Logs

Text

Image

Video

more

pro

cessin

g

http://blog.pivotal.io/big-data-pivotal/products/exploratory-data-science-when-to-use-an-mpp-database-sql-on-hadoop-or-map-reduce

The Big Data Problem

Data is not Information

- Clifford Stoll

Map-Reduce

http://jeremykun.files.wordpress.com/2014/10/mapreduceimage.gif?w=1800

Map-Reduce

https://anonymousbi.files.wordpress.com/2012/11/hadoopdiagram.png

Map-Reduce

http://hadoop.apache.org/docs/r1.2.1/images/hdfsarchitecture.gif

Map-Reduce

https://anonymousbi.files.wordpress.com/2012/11/hadoopdiagram.png

Map-Reduce

Volume Variety VelocityMedium-

LargeUnstructure

d dataBatch

processing

MPP Analytical Database

http://www.ndm.net/datawarehouse/images/stories/greenplum/gp-dia-3-0.png

MPP Analytical Database

http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/K-SafetyServerDiagram.png

MPP Analytical Database

http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/K-SafetyServerDiagramOneNodeDown.png

MPP Analytical Database

http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/K-SafetyServerDiagramTwoNodesDown.png

MPP Analytical Database

http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/DataK-Safety-K2Nodes2And3Failed.png

MPP Analytical Database

JDBC

http://www.ndm.net/datawarehouse/images/stories/greenplum/gp-dia-3-0.png

MPP Analytical Database

Volume Variety VelocitySmall-

Medium-Large

Structured data

Interactive

ASTER DATABASE

Matrix

In-Memory Data Fabric

https://ignite.incubator.apache.org/images/in_memory_data.png

In-Memory Data Fabric

https://ignite.incubator.apache.org/images/in_memory_data.png

In-Memory Data Fabric

https://ignite.incubator.apache.org/images/in_memory_compute.png

In-Memory Data Fabric

http://hazelcast.com/wp-content/uploads/2013/12/IMDGEmbeddedMode_w1000px.png

In-Memory Data Fabric

Volume Variety VelocitySmall-

MediumStructured

data(Near) Real-

Time

Real-Life Project

Sensor data Currently number of devices

doubles every year Data flow ~200GB/month Target data flow

~500GB/month

Real-Life Project

Requirements

When do I need it? In a minute

What do I need to do with it?

Structured analytics

How will I query and search?

Ad hoc SQL

How do I need to store it? I must and I am required to

Where is it coming from? XML

Real-Life Project

Time-series data RESTful API Extendable analytics Scalability Speed to Market

Real-Life Project

Real-Life Project

Availability Zone C

Availability Zone B

Availability Zone A

Processor

Raw message store

Client API

Collector

Analytic Executor Pool

Analytics API

Clients

Devices

3rd Party Services

Analytic Engine

UIRecent

data store

Permanent data store

Availability Zone C

Availability Zone B

Availability Zone A

Processor

Raw message store

Client API

Collector

Analytic Executor Pool

Analytics API

Clients

Devices

3rd Party Services

Analytic Engine

UIRecent

data store

Permanent data store

Real-Life Project

Pre-Processor

Availability Zone C

Availability Zone B

Availability Zone A

Processor

Raw message store

Client API

Collector

Analytic Executor Pool

Analytics API

Clients

Devices

3rd Party Services

Analytic Engine

UIRecent

data store

Permanent data store

Real-Life Project

Post-Processor

Real-Life Project

Vertica stores time-series data only Append-only data store Store organizational data separately Use Vertica’s ExternalFilter for data

load R analytics as UDFs on Vertica Scale Vertica cluster accordingly

Real-Life Project

Choose the right tool for the job, late changes are expensive

You can do everything yourself. Should you?

Q&A