96
Online Analytical Processing of Large Distributed Databases Luc Boudreau Lead Engineer, Pentaho Corporation

Olap scalability

Embed Size (px)

Citation preview

Page 1: Olap scalability

Online Analytical Processing of Large

Distributed Databases

Luc BoudreauLead Engineer, Pentaho Corporation

Page 2: Olap scalability
Page 3: Olap scalability

"its all about data movement and operating on that data on the fly"

Page 4: Olap scalability
Page 5: Olap scalability

a relational database

Page 6: Olap scalability

● Static schema● Minimized redundancy● Referential integrity● Transactional

Relational Databases

Page 7: Olap scalability

Classic RDBMS internals

● "Shared Everything" paradigm

● Private Planner

● Multiple privateprocessors

● Multiple privatedata stores

PLANNER / SCHEDULER

PROCESSOR PROCESSOR PROCESSOR

Page 8: Olap scalability

● Operational data● Normalized models● Static typed data

What RDBMS are for

Page 9: Olap scalability

● "Full Scan" Aggregated Computations● Multi-dimensional queries (think pivot)● Unstructured data

What RDBMS are NOT for

Page 10: Olap scalability

OK so how's that different from Big Data

platforms?

Page 11: Olap scalability

Big Data - More than a buzzword(although sometimes its hard to tell...)

Big Data is not a product. It is an architecture.

Page 12: Olap scalability

Big Data - More than a buzzword(although sometimes its hard to tell...)

A schema-less distributed storage and processing model for data.

Page 13: Olap scalability

Big Data

● Schema less○ Programmatic queries○ "Map" of MapReduce

● High Redundancy○ Distributed processing○ "Reduce" of MapReduce

Page 14: Olap scalability

Big Data

● No referential integrity

● Non transactional

● High latency

Page 15: Olap scalability

Classic Big Data internals

● "Share nothing" paradigm

● Push the processing closer to the data

● The query definesthe schema

SCHEDULER

PROCESSORPROCESSOR PROCESSOR

Page 16: Olap scalability

● Unstructured datakeep everything

● Distributed file systemgreat for archiving

● Data is fixedonly the process evolves

What Big Data is for

Page 17: Olap scalability

● Ludicrous amounts of datakeep everything, remember?

● Made on the cheapeach processing unit is commodity hardware

What Big Data is for

Page 18: Olap scalability

● Low latency applicationsarbitrary exploration of the data is close to impossible

● End-userswriting code is easy. writing good code is hard.

● Replacing your operational DB

What Big Data is NOT for

Page 19: Olap scalability

● No structured query languageexploration is tedious

● Accuracy & Exactitudethe burden is put on the end user / query designer

● No query optimizercannot optimize at runtime. does exactly what you tell it to.

Some more limitations

Page 20: Olap scalability

why is this so similar to NoSQL?

Page 21: Olap scalability

● NoSQL: The thing named after what it lacks which has as many definitions as there are products. (which usually turns out to be some sort of key-value store)

First, defining NoSQL...

Page 22: Olap scalability

● Historical reasons○ Wrong technological choices○ Blind faith in RDBMS scalability○ General wishful thinking and voodoo magic

Why "NoSQL"? Why all the hate?!

Page 23: Olap scalability

● "SQL" itself was never the issue

● NoSQL projects are implementing SQL-like query languages

Why "NoSQL"? Why all the hate?!

Page 24: Olap scalability

bringing structured queries to Big Data

Page 25: Olap scalability

● Straight SQL implementationsGreenplum: Straight SQL on top of Big DataHive JDBC: A hybrid of DSL & SQL

● The Splunk approachSQL with missing columns

● Runtime query optimizersOptiq framework: SQL with Big Data federated sources

Current efforts

Page 26: Olap scalability

isn't there something better than SQL for

analytics?

Page 27: Olap scalability

Online Analytical Processing (OLAP)

Page 28: Olap scalability

● Your favorite corporate dashboards

● Google Analytics & other ad-hoc tools

Widely used. Little known.

Page 29: Olap scalability

● Multidimensional Expressions (MDX)a powerful query language for analytics

● Forget about rows and columnsas many axis as you need

● Slice & dicestart from everything - progressively focus only on relevant data

Analytics centric language

Page 30: Olap scalability

● Hierarchical view ofa multidimensionaluniverse

Business domain driven

Page 31: Olap scalability

What are my total sales for the current year, per month, for male customers?

withmember [Measures].[Accumulated Sales]

as 'Sum(YTD(), [Measures].[Store Sales])'select

{[Measures].[Accumulated Sales]} on columns,{Descendants([Time].[1997], [Time].[Month])} on rows

from[Sales]

where([Customer].[Gender].[M])

An example

Page 32: Olap scalability

how does that work?

Page 33: Olap scalability

● A denormalized model for performancethe data is modelized for read operations - not write

● High redundancybecause sometimes more is better

Analytics data modelization

Page 34: Olap scalability

The Star model

Page 35: Olap scalability

The Snowflake model

Page 36: Olap scalability

different OLAP servers.Different beasts.

Page 37: Olap scalability

● Backed by a relational databasethink of a MDX to SQL bridge.the aggregated data can be cached in-memory or on-disk.

● Relies heavily on the RDBMS performancefigures out at runtime the proper optimizations

Relational OLAP (ROLAP)

Page 38: Olap scalability

● Loads everything in RAM

● Relies on an efficient ETL platform

Memory OLAP (MOLAP)

Page 39: Olap scalability

● On-disk aggregated data filesThink SAS. Cubes are compiled into data files on disk.

● Simple BridgesConverts MDX straight to SQL, with limited support of MDX syntax.

Other OLAP

Page 40: Olap scalability

how do they compare?

Page 41: Olap scalability

(there are no straight answers, sorry)

Page 42: Olap scalability

Where the data lives mattersLocation Speed (ns)

L1 Cache Reference 0.5

Branch Mispredict 5

L2 Cache Reference 7

Mutex lock/unlock 25

Main memory reference 100

Compress 1K bytes w/ cheap algorithm 3000

Send 2K bytes over 1 Gbps network 20 000

Read 1 MB sequentially from memory 250 000

Round trip within same datacenter 500 000

Disk seek 10 000 000

Read 1 MB sequentially from disk 20 000 000

Send packet CA -> Netherlands -> CA 150 000 000

Page 43: Olap scalability

● Java NIO blocksuse extremely compact chunks of 64 bits.

● Primitive typesuse "int" instead of "Integer"

● BitKeysbecause they are naturally CPU friendly

Optimizing for CPU

Page 44: Olap scalability

● Hard limits on the heap spacemust pay attention to the total memory usage.

● Inherent limitationsthere can only be so many individual pointers on heap.

Optimizing for memory

Page 45: Olap scalability

● Payload optimizationbatching. deltas.

● Manageabilityturning nodes on & off.

Optimizing for networking

Page 46: Olap scalability

● Concurrent accessmust carefully manage disk IO.

● Inherently slooooow

Optimizing for disk

Page 47: Olap scalability

how to deal with these issues?

Page 48: Olap scalability

a scalable indexing strategy

Page 49: Olap scalability

● Linear performance is not good enoughas N grows, full scanning takes O(n)

● The rollup combinatorial problemas the cache grows, reuse becomes tedious

Cache indexing

Page 50: Olap scalability

The rollup combinatorial problem

Gender Country Sales

M USA 7

M CANADA 8

F USA 4

F CANADA 2

Country Sales

USA 11

CANADA 10

Page 51: Olap scalability

The rollup combinatorial problem

Country Sales

? ?

? ?

Gender Country Sales

M USA 7

M CANADA 8

Gender Country Sales

F USA 5

Age Country Sales

16 - 25 USA 2

26 - 40 CANADA 3

26 - 40 USA 5

Age Country Cost

41 - 56 USA 5

City Sales

Montreal 6

Quebec 1

Ottawa 8

Vancouver 2

Toronto 5

Page 52: Olap scalability

● Represent the levels / values as bitkeysbecause bitkeys are fast, remember?

● The PartiallyOrderedSeta hierarchical hash set where elements mightor might not be related to one another.

PoSet & BitKeys

Page 53: Olap scalability

● An example applicationfinding all primes in a set of integers

PoSet & BitKeys

Page 54: Olap scalability

a scalable threading model

Page 55: Olap scalability

● Usage of phasespeek -> load -> rinse & repeat

● A scalable threading modelthread safety without locks and blocks

Concurrent cache access

Page 56: Olap scalability

A scalable threading model

● Do things once. Do them right.the actor pattern

Page 57: Olap scalability

a scalable cache management strategy

Page 58: Olap scalability

● All part of a wholeimplicit relation between the dimensions

● Why deltas are necessaryreducing IO

Operating by deltas

Page 59: Olap scalability

● A data block is a complex object

Schema:[FoodMart]Checksum:[9cca66327439577753dd5c3144ab59b5]Cube:[Sales]Measure:[Unit Sales]Axes:[

{time_by_day.the_year=(*)}{time_by_day.quarter=('Q1', 'Q2')}{product_class.product_family=('Bread', 'Soft Drinks')}]

Excluded Regions:[{time_by_day.quarter=('Q1')}

{time_by_day.the_year=('1997')}]Compound Predicates:[]ID:[9c8ba4ec39678526f4100506994c384183cd205d19dd142eae76a9fb1d74cab7]

Cache management

Page 60: Olap scalability

a scalable sharing strategy

Page 61: Olap scalability

● OLAP and key-value storesdon't like each otherOLAP requires a complex key. a hash is insufficient.

● Remember the "deltas" strategy?partially invalidating a block of data would break the hash

Shared Caches

Page 62: Olap scalability

● Well suited for OLAP cachessupports "rich" keys

● Distributed and redundantif a node goes offline, the cache data is not lost

● In-memory grids are fastmultiplies the available heap space

Data grids & OLAP

Page 63: Olap scalability

a case study

Page 64: Olap scalability

Interactive behavioral targeting of advertising in real time

Advertising data analysis

Page 65: Olap scalability

● Low latencythe end users don't want to wait for MapReduce jobs

● Scalability a huge factorwe're talking petabytes of data here

Advertising data analysis

Page 66: Olap scalability

● Queries are not staticwe can't tell upfront what will be computed

● Deployed in datacenters worldwidethe hashing strategy must allow "smart" data distribution

● Almost all open source

Advertising data analysis

Page 67: Olap scalability

OLAPCache

AnalyticalDB

Monitoring & Management

XML/A

olap4j

Client App

olap4j

OLAP

MessageQueue

ETL Designer

BigDataStore

ETL

Logs

ETL

Logs

ETL

Logs

ETL

Logs

LoadBalancer

Page 68: Olap scalability

OLAP

OLAPCache

AnalyticalDB

XML/A

olap4j

Client App

olap4j

LoadBalancer

● A query- UI sends MDX to a SOAP service.- load balancer dispatches the query.- OLAP layer uses its data sources and aggregates.- query is answered

Page 69: Olap scalability

OLAP

OLAPCache

AnalyticalDB

MessageQueue

BigDataStore

ETL

Logs

ETL

Logs

ETL

Logs

ETL

Logs

● An update - Strategy #1- the ETL process updates the analytical DB.- a cache delta is sent to a message queue.- OLAP processes the message.- OLAP uses its index to spot the regions to invalidate.- aggregated cache is updated incrementally.

Page 70: Olap scalability

OLAP

OLAPCache

AnalyticalDB

BigDataStore

ETL

Logs

ETL

Logs

ETL

Logs

ETL

Logs

● An update - Strategy #2- ETL updates the analytical DB.- ETL acts directly on the OLAP cache.- OLAP processes events from its cache.- OLAP updates its index

Page 71: Olap scalability

a stack built on open standards

(get ready, the next slide will hurt your brains)

Page 72: Olap scalability

olap4j server

load balancer

olap4j server olap4j server

connection pool

jdbc

JDBC

Mondrian server manager

olap4j implMondrian server manager

olap4j implMondrian server manager

olap4j impl

Mondrian cache manager

Mondrian cache manager

Mondrian cache manager

infinispan data grid

infinispan infinispan infinispan

Client App

olap4j olap4j olap4j

jdbc

connection pool

jdbc

jdbc

connection pool

jdbc

jdbc

Client App

olap4j-xmla

Java

UDP (Hot Rod)

HTTP (XMLA)

olap4j-xmla

Java

Page 73: Olap scalability

the UI

Page 74: Olap scalability

● A Node.js implementationruns on ManhattanJS hosted execution

● Mojitoclient application framework

● Works both online / offline

Yahoo! Cocktails

Client App

Page 75: Olap scalability

the OLAP service

Page 76: Olap scalability

● JDBC for OLAPextension to JDBC. became the de facto standard.

● A Java toolkit for OLAP- MDX parser / validator- a rich type system / MDX object model- driver specification- programmatic query models- olap4j to XMLA bridge

olap4j-xmla / olap4j-server

XML/A

olap4j

Client App

olap4j

LoadBalancer

Page 77: Olap scalability

the OLAP layer

Page 78: Olap scalability

● Developed by Pentaho Corp.used worldwide. pure java. open source.

● Highly extensibleexposes many APIs & SPIs for enterprise integration.

● ROLAP / MOLAP hybriduses the best of what's available.

● Extensible MDX parsernew MDX functions can be created for specific business domains.

Mondrian

OLAP

Page 79: Olap scalability

the OLAP cache

Page 80: Olap scalability

● memcached○ doesn't have an index.○ enforces random TTLs.○ a hash key is not enough

● simple Java collections

Stuff that didn't work

OLAPCache

Page 81: Olap scalability

● Developed for JBoss ASwell tested.

● UDP Multicastnodes can join and leave the cluster as needed.

● Can distribute the processingjobs can be distributed and ran on the nodes.

● Serializes rich objectsthe contents can be read from APIs.

Infinispan

OLAPCache

Page 82: Olap scalability

the analytical DB layer

Page 83: Olap scalability

● Cluster of instancespartitioned Oracle nodes

● Why Oracle?because their DBAs are good enough with Oracleto get it to run properly under such a load

Oracle

AnalyticalDB

Page 84: Olap scalability

● An analytical oriented DBuse of Vectorwise, Vertica, MonetDB, Greenplum, ...

● Column storesColumn stores scale marvelously and are wellsuited for analytics

Other options

AnalyticalDB

Page 85: Olap scalability

the Big Data layer

Page 86: Olap scalability

● Homebrew Java MapReduce

● 42 000 nodes

● ETL processes managedwith Pig

● A keynote in itself(see the resources at the end for a keynote from Scott Burke, Senior VP of Yahoo!)

Big Data Layer

BigDataStore

ETL

Logs

ETL

Logs

ETL

Logs

ETL

Logs

Page 87: Olap scalability

some numbers

Page 88: Olap scalability

● Big Data layer

○ 140 petabytes○ 500 users○ 42 000 nodes○ 10 000 000 hours of CPU time usage per day○ 100 000 000 000 records per day

Final processing capacity

Page 89: Olap scalability

● Analytical DB layer

○ 50 terabytes○ 100s of tables

(heavy use of the snowflake schema)

○ 1 000 000 000 new rows per day

Final processing capacity

Page 90: Olap scalability

● OLAP layer

○ 10s of Mondrian instances○ 10s of cubes○ 100s of dimensions○ 1 000s of levels○ 1 000 000s of members per level○ 1 000 000 000s of facts per day

Final processing capacity

Page 91: Olap scalability

skunkworks

(future stuff you might care about)

Page 92: Olap scalability

● Big Data as a serviceupload CSVs & other formats to a ad-hoc cluster

● No code requiredMapReduce jobs usually require you to code them

Mondrian over Google's BigQuery

Page 93: Olap scalability

● Interactive data discovery for Big Datafully integrated ETL / OLAP. all you need is a URL and a user / password.

● A rich UI environment for datadrag & drop.full OLAP support.mobile.

● Open source

Pentaho Instaview

Page 94: Olap scalability

resourcesMondrian - The open source analytics engine

mondrian.pentaho.org

olap4j - The open standard for OLAP in Javaolap4j.org

Infinispan - The distributed data grid platform jboss.org/infinispan

Scott Burke, SVP Advertising & Data @ Yahoo!Keynote of Hadoop Summit 2012

youtube.com/watch?v=mR30psmuIPo

Page 95: Olap scalability

resourcesPentaho Instaview

pentahobigdata.com/ecosystem/capabilities/instaview

Page 96: Olap scalability

big thanks

On Twitter: @luclemagnifique

On the blogosphere: devdonkey.blogspot.ca