47
Apache Ignite as MPP Accelerator Alexander Ermakov, CTO

IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Apache Ignite as MPP Accelerator

Alexander Ermakov, CTO

Page 2: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

• About us

• Why do traditional DWH needs in-memory grid?

• Real Time Analytics for Telco Cases

• Integrating Apache Ignite with Arenadata DB

• Using the power of in-memory computing with MPP (Example)

Agenda

Page 3: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

<About us>

Page 4: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

• Arenadata unites a keen team of developers & engineers working on building enterprise data platform.

• We are contributors of Open Source Projects:

• Greenplum

• Apache PXF

• Apache Bigtop

• Members of ODPi (Linux Foundation) since 2015

Who we are?

Page 5: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

ODPi Compliant Platforms

Page 6: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Arenadata Enterprise Data Platform

Plat

form

Ext

ensi

on

Fram

ewor

k

Page 7: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Arenadata - Open Source

store.arenadata.io

Page 8: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Our Partners

Page 9: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Why DWH needs in-memory grid?

Page 10: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Gene Sequencing

Smart Grids

COST TO SEQUENCE

ONE GENOMEHAS FALLEN FROM $100M

IN 2001 TO $10K IN 2011TO $1K IN 2014

READING SMART METERSEVERY 15 MINUTES IS3000X MORE

DATA INTENSIVE

Stock Market

Social Media

FACEBOOK UPLOADS250 MILLION

PHOTOS EACH DAY

Oil Exploration

Video Surveillance

OIL RIGS GENERATE

25000DATA POINTS PER

SECOND

Medical Imaging

Mobile Sensors

New Generation of Business Cases

Page 11: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Data Value Chain

ms seconds hours weeks months year years+

Page 12: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Data Warehouse

BI

SP

Table

OLTP

OD

S DD

S

Dat

a M

art

DWH

ESAPI

ELT & DQ

Batch

CDC

Sources Transport Store AnalyzeTransform

Page 13: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Data Lake

OLTP

SP

Table

ELT & DQ

OD

S DD

S

Dat

a M

art

DWH

Batch

ESAPI

CDC…

API

Queue

Hadoop

HDFSSQLOn

Hadoop

Sources Transport Store AnalyzeTransform

BI

Page 14: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Lambda Architecture

ELT & DQ

OD

S DD

S

Dat

a M

art

DWH

Batch

STG

BatchApp

Hadoop

Real Time

App

HDFSSQLOn

Hadoop

ESAPI

CDC

Sources Transport Store AnalyzeTransform

BI

Queue

Page 15: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Kappa Architecture

STG

BatchApp

Real Time

App

Sources Transport Store AnalyzeTransform

BI

Queue

Page 16: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Real Time Analytics for Telco Cases

Page 17: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Customer Retention / Connection Breakdowns

Page 18: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Geo Marketing

Page 19: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Migrating from a Reactive, Static and Constrained Model…

HDFS

Data Lake

Ingest Store Analytics

Hard to changeLabor intensive

Inefficient

Coding basedNo real-time informationBased on expensive ETL

Page 20: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

To Pro-Active, Self-Improving, Machine Learning Systems

HDFSData Lake Expert System / Machine Learning

In-Memory Real-Time Data

Continuous LearningContinuous Improvement

Continuous Adapting

Data Stream Pipeline

Multiple Data SourcesReal-Time ProcessingStore Everything

Page 21: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Sandboxes

Data FeedsStream Processing

Expert Systems Machine Learning

Historical Data

Business ValueSmart Decisions

HDFS

Data Lake

Page 22: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Data Streaming Reference ArchitectureData Feeds Transactional Apps Analytic Apps

Data Stream Pipeline

Real Time Data & Distributed Computing

Expert Systems & Machine Learning

AdvancedAnalytics

Data Lake

Page 23: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Data Streaming Reference ArchitectureData Feeds Transactional Apps Analytic Apps

Data Lake

Page 24: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Integrate Apache Ignite with Arenadata DB

Page 25: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Arenadata Grid

Page 26: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Arenadata Grid Use Cases

Page 27: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

SegmentSegment Segment Segment Segment…

Segment Host with one or more Segment InstancesSegment Instances process queries in parallel

Flexible framework for processing large datasets

High speed interconnect for continuous pipelining of data processing

Master Host and Standby Master HostMaster coordinates work with Segment Hosts

Segment Hosts have their own CPU, disk and memory (shared nothing)

Arenadata DB Architecture

StandbyMaster

MasterHost

SQL

Page 28: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Greenpum Core Development• Zstandard support (will be added to stable at 6.0.0 due to naming convention)

• PXF development: we bet a lot. Ignite integration, push down feature, JDBC & Ignite stable release• Few bugs and a lot of issues

Page 29: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

29

Parallel Query Optimizer• Cost-based optimization looks for

the most efficient plan• Physical plan contains scans, joins,

sorts, aggregations, etc.• Global planning avoids sub-optimal

‘SQL pushing’ to segments• Directly inserts ‘motion’

nodes for inter-segment communication

PHYSICAL EXECUTION PLANFROM SQL OR MAPREDUCE

Gather Motion 4:1(Slice 3)

Sort

HashAggregate

HashJoin

Redistribute Motion 4:4(Slice 1)

HashJoin

Hash Hash

HashJoin

Hash

Broadcast Motion 4:4(Slice 2)

Seq Scan on motion

Seq Scan on customerSeq Scan on line item

Seq Scan on orders

Page 30: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

30

MADlib: Toolkit for Advanced Big Data Analytics

• Better Parallelism– Algorithms designed to leverage MPP or Hadoop

architecture• Better Scalability

– Algorithms scale as your data set scales– No data movement

• Better Predictive Accuracy– Using all data, not a sample, may improve accuracy

• Open Source – Available for customization and optimization by user

Page 31: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

31

MADlib In-DatabaseFunctions

Predictive Modeling Library

Linear Systems• Sparse and Dense Solvers

Matrix Factorization• Singular Value Decomposition (SVD)

Generalized Linear Models• Linear Regression• Logistic Regression• Multinomial Logistic Regression• Cox Proportional Hazards• Regression• Elastic Net Regularization• Sandwich Estimators (Huber white,

clustered, marginal effects)

Machine Learning Algorithms• ARIMA• Principal Component Analysis (PCA)• Association Rules (Affinity Analysis, Market

Basket)• Topic Modeling (Parallel LDA)• Decision Trees• Ensemble Learners (Random Forests)• Support Vector Machines• Conditional Random Field (CRF)• Clustering (K-means) • Cross Validation

Descriptive Statistics

Sketch-based Estimators• CountMin (Cormode-

Muthukrishnan)• FM (Flajolet-Martin)• MFV (Most Frequent

Values)CorrelationSummary

Support Modules

Array OperationsSparse VectorsRandom SamplingProbability Functions

Page 32: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Polymorphic Table Storage

Historical data(Years)

slow HDD

Actual data(months)

regular HDDNow data(hours)SSD

Single table

• Provide the choice of processing model for any table or any individual partition

– Enable Information Lifecycle Management (ILM)

• Storage types can be mixed within a table or database

– Four table types: heap, row-oriented AO, column-oriented, external

– Block compression: Gzip (levels 1-9), Zstd– Columnar compression: RLE

Page 33: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Platform eXtension Framework (PXF)

• An advanced version of Greenplum external tables

• Supports connectors for HDFS, HBase and Hive, JDBC, Ignite (Arenadata DB)

• Provides extensible framework API to enable custom connector

Page 34: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

PXF Profiles• HDFS Files• Ignite• JDBC• Avro• HBase• Hive

– Text based– SequenceFile– RCFile– ORCFile

CREATE EXTERNAL TABLE pxf_sales_part(item_name TEXT, item_type TEXT, supplier_key INTEGER, item_price DOUBLE PRECISION, delivery_state TEXT, delivery_city TEXT

)LOCATION (‘pxf://grid_host?Profile=Ingite&IGNITE_CACHE=test&BUFFER_SIZE=10000’);

Page 35: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

PXF Profiles

<profile>

<name>Ignite</name>

<plugins>

<fragmenter>IgniteFragmenter</fragmenter>

<accessor>IgniteAccessor</accessor>

<resolver>IgniteResolver</resolver>

<analyzer>IgniteAnalyzer</analyzer>

</plugins>

</profile>

Page 36: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

• Fragmenter– returnsalistofsourcedatafragmentsandtheirlocation

• Accessor– accessagivenlistoffragmentsreadthemandreturnrecords

• Resolver– deserializeeachrecordaccordingtoagivenschemaortechnique

• Analyzer– returnsstatisticsaboutthesourcedata

PXF Classes

Page 37: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Date User_id Message

21-01-2018 16 <message>

01-11-2018 500 <message>

15-05-2018 2042 <message>

17-09-2017 15 <message>

15-06-2016 55 <message>

24-12-2015 3510 <message>

01-01-2012 19 <message>

26-04-2013 42 <message>

23-05-2010 17 <message>

GridexternaltableLatency:millisecondsCostperGB:$$$

HadoopexternaltableLatency:tensofsecondsCostperGB:$

RegularADBtableLatency:secondsCostperGB:$$

…whereDate>20-01-2018

…partitionbyDate(partition1:Date=>01-01-2018partition2:Date<01-01-2018andDate=>01-01-2015partition3:Date<01-01-2015)

…whereDate<18-09-2017

…whereDate>16-06-2017ANDUser_id<400

Pushdown

Partitionfilter

PushdownfilterExecutedinexternalsystem

PXF Pushdown Feature

Page 38: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

PXF Pushdown Feature

Page 39: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Using power of In-Memory computing with MPP

Page 40: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Test Bench

Ignite1 Ignite2

GreenplumMaster

GreenplumSeg2

GreenplumSeg1

HadoopNamenode

HadoopDatanode2

HadoopDatanode1

Internal Affinity Functions

Arenadata Unified Data Platform

PXF interaction

Page 41: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Creating Table in MPP

Page 42: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Creating External Table for Apache Ignite & Load Data

Page 43: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Creating External Table in Hive & Load Data

Page 44: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Exchange Partitions with External Tables

Page 45: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Target Table

Page 46: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Execution Plan

prt2: Greenplum Heap Partition

prt1: Ignite Cache Partition

Page 47: IMCS Apache Ignite as MPP Accelerator Ignite as MPP Accelerator Alexander Ermakov, CTO • About us • Why do traditional DWH needs in-memory grid? • Real Time Analytics for Telco

Thank you!

Questions?