45
Emerging Data Architectures for Next Generation Workloads MILIND BHANDARKAR FOUNDER & CEO, AMPOOL INC.

Platform Track - Emerging Data Architecture @ ABDW17, Pune

Embed Size (px)

Citation preview

Emerging Data Architectures for Next Generation WorkloadsMILIND BHANDARKAR

FOUNDER & CEO, AMPOOL INC.

About Me

• http://www.linkedin.com/in/milindb• Founding member of

Apache Hadoop team at Yahoo [2005-2010]• HDFS, MapReduce, Pig,

YARN, HAWQ, Geode…

• Chief Architect at Greenplum Labs (2011-2013)• Chief Scientist at Pivotal

Software (2013-2015)• Founder, CEO Ampool

(2015-)

Agenda

• History of Data Platforms

• Rise of RDBMS

• Hadoop & NoSQL

• Technology Trends

• Ampool : Active Data Store for Next Generation Workloads

Before 1975: CODASYL

• Committee On Data System Languages

• Development of Data Processing Languages (resulted in COBOL)

• 1967 – Database Technology Group (DBTG)

• Network Database Model

1975 - Today: System R & RDBMS

• 1970 – Relational Model of Data by E. F. Codd

• Projections, Selections, Joins, Difference, Union

• 1974-75 – IBM System R – An Experimental Database System based on Relational Algebra

• Structured Query Language - SQL

6

7

8

What Changed?

9

10

11

12

Big Data

13

Early 2000s: OLTP – OLAP Split

• OLTP/ODS• Transactional, Low Latency, Highly Concurrent, Scale Up

• Highly Structured, Normalized Data Model

• Customer facing, ~50-50 Reads/Updates

• Oracle, MS SQL Server, IBM DB2, MySQL, Postgres

• OLAP/BI• Not transactional/long-running transactions, High throughput, Scale Out

• Structured/Semi-structured, Denormalized data model

• Business internal, ~90-10 Reads/Updates

• Teradata, Netezza, Vertica, Greenplum, Aster Data

• Mostly MPP

CAP Theorem

• Brewer’s Conjecture 1999

• Proved 2002 by Gilbert & Lynch

• It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:• Consistency

• Availability

• Partition tolerance

2005-Today: NoSQL

• Amazon Dynamo, Google BigTable

• Key-Value Stores (e.g. Voldemort, Riak)

• Wide-Column, Column-Family Stores (e.g. HBase, Cassandra)

• Document Stores (e.g. MongoDB, CouchBase)

• Graph Stores (e.g. Neo4J)

• Time-Series Stores (Metric Stores, e.g. InfluxDB, OpenTSDB)

• Non-Relational? Non-Transactional? No SQL or “Not Only SQL” or “Not Yet SQL” ?

• Deconstruction of the Database

17

Source: https://www.cbinsights.com/blog/disrupting-banking-fintech-startups/ November 2015

Components of Database Systems

Hadoop Distributions – 2013

Hadoop Components – Today

Hadoop Distribution Versioning

Data Lakes• First Gather All Data• Transactional Data from Business Applications•User Behavior Data from Interactive Applications• Social Media Data, Location Data• Public Datasets, Third-Party Data

• Then, Magic Happens … ?

22

Analytics:

Context = f(Data)

23

Time Value of Data24

© Tibco Spotfire, InterOp, November 2013

Real-Time Context is more important thanHistorical Context

25

73%Planning, Implementing, or Expanding the use of Real-Time Data Platforms

26

Source: Forrester Global Business Technographics Data And Analytics Online Survey, 2015

Enterprises are hyper-personalizing Appsusing advanced predictive Analyticsdriven by high-velocity Big Data

2/3/17Ampool® Confidential

27

DATA "

USERS

Analytics

#

#

#

#

Apps

Multi-Device

Testing

$

%

&

|

)

Enterprises Aspire To Build & Continuously Tune Real-time Intelligent, Data-driven Applications

28

ChallengesSlow, complexdata pipelines

Real-time App pressure

Operational complexity

Data Silos & DisconnectedProcessing

Problems With Current Lambda Architecture

Multiple Data Stores• One each for streaming, batch,

queries, and applications cache

Need to create multiple copies of data• Format conversions• Data Serialization & Deserialization

at each stage• Data governance is distributed &

complex

Latency due to data propagation• Inhibits real-time insights

29

Data Sources

Streaming Layer

Batch LayerAll Data

!! Computations

!! Computations

Query Layer!! Query Engine

Application(s)

Problem: Complex, Slow Data ProcessingSolution: Memory-centric Active Data Store

30

Eliminate• File-based data

exchange

• File format Conversions

• Data Copies

• Serialization overheads

• Lack of Multi-tenancy

DATA

!

"

#

$

|

'

(

(

(

!!

!

* DATA

"

#

$

%

|

(

)

""

"

Ampool: Unified Active Data Store Closes The Loop Driving Value At The Speed Of Business

31

Ingest & Store hot data & update in-situServe data concurrently to multiple stages & tenantsAutomatically tier data to warm & cold (archive) stores, with usage & timeLink insights back to Applications driving decisions in a closed loop

DATA

"

"

"

"

#

$

|

'

(ampool

Technology Trends…

• Huge (and rapidly growing) gap between memory and I/O bandwidths

• Rapidly increasing Network Bandwidth (~1000x in ~15 years)

• Plummeting costs of Solid State Storage (Comparable to HDD by 2019)

• NVDIMMs supported by major OSs, 10x density, 1/5th $/GB compared to DRAM

• Emergence of Storage Class Memory• 3D XPoint, PCM, MRAM, Memristor etc.

32

…leading to Mainstream In-Memory Computing

• Scale-Out On-Demand Compute Infrastructure• Public & Private Clouds

• Fine-Grained Virtualization & Microservices• Containerization & Orchestration

Memory vs Disk Throughput Growth

Memory Hierarchy Getting Deeper

Emergence of Storage Class Memory

Why Now? Storage Technologies Price/Performance

Storage Type Cost /GiB ($) Latency (ns)

IOPS(4KiB Random

I/OPer Second)

Bandwidth(MB/s)

Million IOPS Per GB Cost of Storage

GB/s Bandwidth

Per GiB Cost of Storage

Min Max Min Max Min Max Min Max

DRAM (DDR4) 6 10 30 50 15,000,000 60,000 1,500,000 2,500,000 6.0 10.0 SCM (3DXpoint, Projected) 3 5 100 500 10,000,000 10,000 2,000,000 3,333,333 2.0 3.3

NAND PCIe3 SSD (MLC) 2 6 50,000 1000000 100,000 3,000 16,667 50,000 0.5 1.5

HDD (7200 RPM) 0.03 0.2 5,000,000 10,000,000 100 100 500 3,333 0.5 3.3

36

Ampool: Distributed Memory-centric Active Data Store Powered By Apache Geode

Loca

tor

Serv

er

Serv

er

Serv

er

Serv

er

!!!!

!

#

$$$

%% %$$ $$ $$ $$

REST

In-Memory Distributed Sys

Low-latency Comms.

Key-Value Store

Function Pushdown

+

High Throughput

Table Store

Pluggable Store Manager

Java API

Java API

Smart Data Tiering

Mature Event Model

Tunable Consistency

Metadata/ Catalog

Security AuthZ

37

~2xFaster than HBase on Inserts

~4xFaster than HBase on Scans

~6xFaster than Alluxio (Tachyon) on

Scans

2-3xFaster than HBase on Lookups

Multi-Modal Analytics

• Real-time performance

• Operational Analytics

• Batch/ Machine Learning

• Business Intelligence

• Predictive Maintenance

Target Verticals & Use-Cases

Financial Svcs. Telecom Retail Media• Fraud Detection• Credit/ Market risks• Event-based marketing

• Network/ quality opt.• Mobile user analysis• Event-based marketing

• Targeted digital offers• Markdown optimization• Event-based targeting

• Content/ ad delivery• Event/ behavior-based

targeting

Anomaly Detection IoT Analytics• Event/ activity monitoring• Real-time automated decisions

• Device management• Comms. optimization

360 Customer Analytics• Social media sentiment analysis• Event-based ad targeting

39

Demo

Illustrative Use Case in Ad-Tech

Ad Analytics Pipeline with Kafka-Datatorrent-Ampool

Streaming Ad Analytics

[email protected]

& /company/ampool-inc- '( /AmpoolIO@AmpoolIO# www.ampool.io

2/3/17Ampool® Confidential

45