40
© 2009 VMware Inc. All rights reserved Building Big Data Applications Services for Private Clouds Richard McDougall Chief Architect, Storage and Application Services VMware, Inc @richardmcdougll

Building Big Data Applications

Embed Size (px)

DESCRIPTION

A talk on the frameworks for building big-data applications.

Citation preview

Page 1: Building Big Data Applications

© 2009 VMware Inc. All rights reserved

Building Big Data Applications Services for Private Clouds

Richard McDougall

Chief Architect, Storage and Application Services

VMware, Inc

@richardmcdougll

Page 2: Building Big Data Applications

2

Infrastructure, Apps and now Data…

Private Public

Build Run

Manage

Simplify Infrastructure With Cloud

Simplify App Platform Through PaaS Simplify Data

Page 3: Building Big Data Applications

3

Trend 1/3: New Data Growing at 60% Y/Y

Source: The Information Explosion, 2009

medical(imaging,(sensors(

cad/cam,(appliances,(machine(data,(digital(movies(

digital(photos(

digital(tv(

audio(

camera(phones,(rfid(

satellite(images,(logs,(scanners,(twi7er(

Exabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta generation…

Page 4: Building Big Data Applications

4

Data Growth in the Enterprise

Page 5: Building Big Data Applications

5

Trend 2/3: Big Data – Driven by Real-World Benefit

Page 6: Building Big Data Applications

6

Enterprise : Early Adopter Industries and Use Cases

Page 7: Building Big Data Applications

7

Early Adopters: Enterprise Segmentation

Verticals! Targets! Use Cases!

•  Existing Hadoop Users"•  Business Analysts"•  Data Scientists"•  LOB managers"•  IT/Ops"

•  Business Trend Analytics"•  Revenue analytics"•  CDR, call pattern analytics"•  Sensor data analytics"•  Log, machine data analytics"•  Fraud detection"•  Homeland security"•  Predictive analytics"

•  Financial Services"•  Retail"•  Telco"•  Manufacturing"•  Government"

Page 8: Building Big Data Applications

8

•  End users/Exec users"•  Business Analysts"•  PM, LOB managers"•  Marketing/Sales"•  Data Engineers"•  Data Scientists"•  IT/Operations"

•  Behavioral Analytics"•  Audience segmentation"•  Revenue Optimization"•  User activity monetization"•  Inventory, price

management"•  Recommendations"•  Predictive analytics"

•  Online Advertising"•  eCommerce"•  Mobile"•  Social Media"•  Gaming"

Verticals! Targets! Use Cases!

Early Adopters: Non-enterprise Segmentation

Page 9: Building Big Data Applications

9

Why now? more transactions (Social/Mobile/Local)

3.7B calls/month

30B messages/month

10k card transactions/sec

1TB data/day

500 TB data/day

SoMoLo

35 check-ins/sec

13k API calls/sec

Big “traditional” companies

Size of data communications transactions

Page 10: Building Big Data Applications

10

Trend 3/3: Value from Data Exceeds Hardware Cost

!  Value from the intelligence of data analytics now outstrips the cost of hardware • Hadoop enables the use of 10x lower cost hardware • Hardware cost halving every 18mo

Big Iron: $40k/CPU

Commodity Cluster: $1k/CPU

Value

Cost

Page 11: Building Big Data Applications

11

The Old Big Data Stack

E T L Column Oriented

Relational Database (Oracle, Teradata, DB2)

Data Visualization (Crystal, Bus O)

Extract, Transform, Load

(Informatica)

Business Intelligence

Statistics (SAS, SPSS)

Master Data Management (Oracle, SAP)

Files

SQL Databases

Page 12: Building Big Data Applications

12

The Old Big Data Stack

E T L

Column Oriented Relational Database

(Oracle, Teradata, DB2)

Data Visualization (Crystal, Bus

O)

Extract, Transform, Load

(Informatica)

Business Intelligence

Statistics (SAS, SPSS)

Master Data Management (Oracle, SAP)

Files

SQL Databases

!  Unable to handle large data volumes & diversity of data

!  Iterative, brute-force and slow process

!  Lack of ad-hoc data navigation across events and time

!  Cumbersome ETL to “process” and DBAs to “prepare”

!  Focused on structured data that is warehoused

! Web analytics solutions force real-time events into rigid schemas in DBs

Page 13: Building Big Data Applications

13

The Journey To Big Data Analytics

All Data Faster Answers Elastic & Scalable 1 2 Data Science

Collaboration Self-Service

Agile Analytics People & Productivity Focus

Analytic Productivity Platform

Agile Process & Tools

3 Real Time Decisions New Applications Data Monetization

Predictive Enterprise Application Focus

Big Data Enabled Apps

BI As A Service Technology Focus

Analytics Engines

Cloud Infrastructure

Analytic Engines

Goal: encourage experimentation

with existing data

Goal: discover meaningful insights that

impact the business

Goal: operationalize those insights

as quickly as possible

Page 14: Building Big Data Applications

14

1.  Business analysts, LOB managers, execs •  Need: out-of-the-box analytics •  Designed for: self-service for end-user leveraging app

developers

2.  Data engineers/analysts •  Need: out-of-the-box + some customization

•  Designed for: admin + operations

3.  Data scientists •  Need: power capabilities + heavy customization •  Designed for: data scientists

4.  IT, Operations •  Need: out-of-the-box + some customization •  Designed for: IT/admin, ops

Customer profiles

Page 15: Building Big Data Applications

15

Distributed, Parallelization Algorithm

& programming Skills

Math and Statistical Knowledge

Business Domain and Problem

Understanding

Vertical or Horizontal Use case and Analytics Experience

Data Science &

Data Engineering

What is Data Science and Data Engineering?

Page 16: Building Big Data Applications

16

What is Driving Big Data?

Structured

Largely Unstructured

Semi-structured

Source: IBM and Oxford Survey: Getting Closer to Customers Tops Big Data Agenda, October 17, 2012

Page 17: Building Big Data Applications

17

Today’s Big Data System:

ETL

Real Time Streams

Unstructured Data (HDFS)

Real Time Structured Database

Big SQL

Data Parallel Batch

Processing

Real-Time Processing

(s4, storm)

Analytics

Page 18: Building Big Data Applications

18

Cloud Infrastructure

Data Platform

Private Public

Developer Frameworks

The Unified Analytics Cloud Platform

Analytics Tools

vSphere

Database/DataStore Cassandra

HawQ hBase

Impala HDFS

Data PaaS

PaaS Hadoop R Python

Madlib

Cloudfoundry

Data Meer Karmasphere

Spring

Data-Director EMC Chorus

Tableau

Page 19: Building Big Data Applications

19

The New Big Data System

E T L

Real Time Streams

Structured and Unstructured Data (HDFS, S3)

Real Time Structured Database

Structured Data

Engine

Unstructured and Batch

Processing (Hadoop, Hive)

Real-Time Stream

Processing Data Visualization

(Excel, Tableau)

Federated Query (SQL aggregation)

Compute Storage Networking

Cloud Infrastructure

Common Query

Automated Models

Business Intelligence

Page 20: Building Big Data Applications

20

An Example – Automated Performance Management

10M Performance

Stats/min

Stats Database

Batch Baseline

Calculation

Trigger Models

Compute Storage Networking

Cloud Infrastructure

Page 21: Building Big Data Applications

21

Big (Data) problems: becoming the standardized stack

Google( Facebook( Yahoo( Linked(in( Cloudera( Twi7er(

Metadata& Dremel& Hive& Hive& Hive&Schedule&&&pipeline&workloads& Evenflow& Databee& Oozie& Azkaban& Oozie&

dataflow/queries& A/Sawzall& /Hive& Pig/Hive& Pig& Pig/Hive& Cascading&

MoreAstructured&data&store& Bigtable& Hbase& Hbase& Voldemort& Hbase& Cassandra&DB&data&collecGon/integraGon& MySQL&gateway& Sqoop& Sqoop&

Event&data&collecGon& Scribe&Data&Highway& KaLa?& Flume& Scribe&

Streaming&data&processing& A& A& A& A& A& A&

Batch&data&processing& Map/Reduce& Hadoop& Hadoop& Hadoop& Hadoop& Hadoop&

File&Storage& GFS& Hadoop& Hadoop& Hadoop& Hadoop& Hadoop&

CoordinaGon& Chubby& Zookeeper& Zookeeper& Zookeeper& Zookeeper& Zookeeper&

Page 22: Building Big Data Applications

22

New Technologies

E T L

Real Time Streams

HDFS, Ceph, MAPR, Collosos

Real Time Structured Database

Aster, Greenplum

Etc,

Unstructured and Batch

Processing (Hadoop, Hive)

Real-Time Stream

Processing Data Visualization

(Excel, Tableau)

Query Virtualization (SQL aggregation)

Compute Storage Networking

Cloud Infrastructure

Common Query

Automated Models

Business Intelligence

Twitter Sensor Data

Mobile Events Machine Logs

S4, Storm

SPARK SHARK Gemfire hBase?

Map-Reduce

Machine Learning CETAS

Page 23: Building Big Data Applications

23

Agenda

! Frameworks •  Batch processing: Hadoop, Spark •  Graph processing: Pregel, Apache Giraph •  Real-time processing: Storm, S4, D-Streams •  Interactive processing: Hive, Impala, Shark

! New requirements •  Better network architectures, abstractions and end-to-end resource

management •  Whither disk-locality and the flexibility to move data to compute

instead •  Cluster/Datacenter-wide storage abstractions and services •  The silo-less datacenter (multiple frameworks sharing a single

physical cluster and sharing �sticky� data)

Page 24: Building Big Data Applications

24

Big Data Processing Patterns (batch, real-time or interactive)

Reverse Funnel (small input, large output, e.g., logfile loading)

Funnel (large input, small output, e.g., link/ad click-statistics)

Data transform (input and output sizes similar, e.g, data conversion/ translation)

Iterative, e.g, Machine learning tasks

Graph-based analyses to reason about relationships, e.g., PageRank, Ravi�s social approach to VI management

Hadoop, Hive, Impala Storm, S4, D-Streams, Shark

Spark

Pregel, Giraph

Page 25: Building Big Data Applications

25

Batch processing frameworks (1/2)

! Apache Hadoop MapReduce (Yahoo!)

•  Parallel data-processing paradigm (made popular by Google). Uses a distributed file system (HDFS) for persistence. Uses commodity h/w

•  Model of operation: Mapper (read from HDFS + compute in parallel) -> Reducer (process map outputs in parallel) -> write to HDFS

•  Key components: Namenode, Datanode, TaskTracker, JobTracker •  Apache Zookeeper sometimes used for coordination •  Weakness: Not well-suited for iterative (or graph) computations

Page 26: Building Big Data Applications

26

Batch processing frameworks (2/2)

!  Spark (UC Berkeley)

•  Support for iterative computations and interactive data-mining by caching data in cluster RAM. Uses commodity machines

•  Core abstraction: Resilient Distributed Datasets (RDDs) used as variables in Spark programs. RDDs include lineage data for easy recovery/reconstruction

•  Up to ~20X speedup over Hadoop. Used by Quantifind, Conviva, …

Image courtesy Zaharia et al.: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Page 27: Building Big Data Applications

27

! Pregel (Google)/Apache Giraph

•  Multiple instances of vertex-programs: user-defined functions running at/on each vertex

•  Bulk Synchronous Parallel (BSP) processing, e.g., used for PageRank •  Stateful in-memory computations. Fault-tolerance via checkpoints •  Runs on commodity hardware (racks with high intra-rack bandwidth)

Graph processing frameworks

Compute Communicate B

arrier

VM2 VM1

Page 28: Building Big Data Applications

28

Real-time processing frameworks (stream-processing) 1/2

! S4 (Yahoo!), Storm (Twitter) •  Record-at-a-time processing. Checkpointing for fault-tolerance (S4)

Image courtesy Zaharia et al.: https://www.usenix.org/sites/default/files/conference/protected-files/zaharia_hotcloud12_slides.pdf

Page 29: Building Big Data Applications

29

Real-time processing frameworks (stream-processing) 2/2

! Discretized Streams/D-Streams (UC Berkeley) •  Treat a streaming computation as a series of batch computations on

small time intervals. D-Stream = chain of RDDs •  Fault-tolerance without replication or upstream backup (buffering)

Time

Image courtesy Zaharia et al.: https://www.usenix.org/sites/default/files/conference/protected-files/zaharia_hotcloud12_slides.pdf

Page 30: Building Big Data Applications

30

! Apache Hive (Facebook) •  Open-source data warehouse built on top of Hadoop. HiveQL

queries compiled into MapReduce jobs. Expensive Where clauses = Table scans = high latency

Interactive processing frameworks 1/4

Image courtesy Cubrid: http://www.cubrid.org/blog/dev-platform/platforms-for-big-data/

Page 31: Building Big Data Applications

31

Interactive processing frameworks 2/4

!  Interactive Processing Frameworks – Pivotal Hawk

Page 32: Building Big Data Applications

32

Interactive processing frameworks 3/4

! Impala (Cloudera) •  Inspired by Dremel (Google). Key concepts: columnar-data storage

(Trevni), aggregation trees for distributed query evaluation •  Takes advantage of Hive tables. Uses memory as a cache for tables •  Does not use MapReduce to answer queries (unlike Hive). •  3X - 90X faster than Hive

Image courtesy Cloudera: http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

Page 33: Building Big Data Applications

33

Interactive processing frameworks 4/4

! Shark (UC Berkeley) •  Key concepts: columnar-data storage (in-memory), Directed Acyclic

Graphs of Tasks for distributed query optimization and evaluation, dynamic mid-query replanning

•  Uses Spark RDDs to store data and query processing results •  SQL-interface (HiveQL compatible) •  100X faster than Hadoop, 100X faster than Hive

Image courtesy Xin et al.: http://shark.cs.berkeley.edu/presentations/2012-11-26-shark-tech-report.pdf

Page 34: Building Big Data Applications

34

Unifying the Big Data Platform using Virtualization

! Goals • Make it fast and easy to provision new data Clusters on Demand •  Allow Mixing of Workloads

•  Leverage virtual machines to provide isolation (esp. for Multi-tenant) • Optimize data performance based on virtual topologies

• Make the system reliable based on virtual topologies

!  Leveraging Virtualization •  Elastic scale • Use high-availability to protect key services, e.g., Hadoop’s namenode/job

tracker • Resource controls and sharing: re-use underutilized memory, cpu

•  Prioritize Workloads: limit or guarantee resource usage in a mixed environment

Cloud Infrastructure

Private Public

Page 35: Building Big Data Applications

35

SQLCluster

Unifed Analytics Infrastructure

Hadoop Cluster

Private Public

Big SQL

A Unified Analytics Cloud Significantly Simplifies

Hadoop NoSQL

Decision Support Cluster

NoSQL Cluster

!  Simplify •  Single Hardware Infrastructure •  Faster/Easier provisioning

! Optimize •  Shared Resources = higher utilization •  Elastic resources = faster on-demand access

Page 36: Building Big Data Applications

36

Simplify Hetrogeneous Data Management via Data PaaS

Cloud Infrastructure

Data Platform

Developer

Analytics Tools

Databases

File-system

Big SQL

Large-Scale

NoSQL

In-Memory

Data PaaS – Common Data Management Layer

Provisioning

Management

Multi-tenancy

Data Discovery

Import/Export

Cloud Infrastructure

Page 37: Building Big Data Applications

37

Technology: Databases and Data Stores for Big Data

File-system

Big SQL

Large-Scale

NoSQL

In-Memory

Unstructured Structured

Types of Data

Log files, machine generated data, documents, device data, etc…

Loosely typed device data, records, events, statistics, complex relations/graphs

Structured, partitionable data Structured data

Techno-logies

NAS, HDFS, Blob, S3, MAPR, etc..

Cassandra, hBase, Voldemort

Gemfire, Redis, Membase, SPARK

HawQ, Impala, Aster, …

Values

Store any data, easy to scale-out, can optimize for cost

Easy to scale-out, flexible and dynamic schema’s

High Throughput, low latency

High performance for repetitive queries. Ease of query language.

Page 38: Building Big Data Applications

38

Cloud Infrastructure

Data Platform

Private Public

Developer Frameworks

The Unified Analytics Cloud Platform

Analytics Tools

vSphere

Database/DataStore Cassandra

Greenplum hBase

Voldemort HDFS

Data PaaS

PaaS Hadoop Python

Madlib

Cloudfoundry

Data Meer Karmasphere

Spring

Data-Director EMC Chorus

Tableau

R

Page 39: Building Big Data Applications

39

Summary

!  Revolution in Big Data is under way • Data centric applications are now critical

!  Hadoop on Virtualization •  Proven performance

• Cloud/Virtualization values apparent for Hadoop use

!  Simplify through a Unified Analytics Cloud • One Platform for today’s and future big-data systems

•  Better Utilization

•  Faster deployment, elastic resources •  Secure, Isolated, Multi-tenant capability for Analytics

Page 40: Building Big Data Applications

40

References

!  Twitter • @richardmcdougll

! My CTO Blog •  http://communities.vmware.com/community/vmtn/cto/cloud

!  Hadoop on vSphere •  Talk @ Hadoop World

•  Performance Paper – http://www.vmware.com/files/.../VMW-Hadoop-Performance-vSphere5.pdf

!  Spring Hadoop •  http://blog.springsource.org/2012/02/29/introducing-spring-hadoop