MAD Skills: New Analysis Practices for Big Data

Slides courtesy of original paper & Christan Grant’s slides Presented by Long Pham

11/11/2014

• Authors

• Jeff Cohen – Greenplum

• Brian Dolan – Fox Audience Network

• Mark Dunlap – Evergreen Technologies

• Joseph M Hellerstein – UC Berkeley

• Caleb Welton - Greenplum

• Presented at Very Large Database Conference 2009 in Lyon, France

What not to expect…

• Smart system supports novice users

• Diagrams + Pictures

• Quantitative experiments

• Formal proof

What to expect…• Smart system supports smart users

• Analysis

• Implementation examples

• Real application scenarios

• Reflection & Future discussion

• If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap.

• So what’s getting ubiquitous and cheap? Data.

• And what is complementary to data? Analysis.

• – Prof. Hal Varian, Chief Economist at Google

Traditionally, data analytics (OLAP) = well-structured data warehouse

• Single expensive center dedicated for analytics (which is separate from OLTP)

• Pre-materialization for pre-defined tasks

• Jealously guarded by engineers

• To ensure high quality integration

Things are changing towards decentralized analytics centers• Cheap storage

• World largest 10 years ago ~ $100 nowadays

• Massive data

• Even from a single source like clicks

• Popular analytics

• Proved to be profitable

New paradigm is needed: Magnetic Agile Deep

• Magnetic: attracts all data sources regardless quality

• vs. a single center

• Agile: : continuously adaptive structure

• vs. a rigid well-structured architecture

• Deep : supports sophisticated algorithms

• vs. limits within roll-up, drill-down, etc.

Magnetic

Agenda• MAD vs. traditions

• Scenario: Fox Audience Network (FAN)

• MAD design

• MAD algebra implementation

• Variable types

• Corresponding operators

• MAD system implementation

• Reflections & Directions

• MAD design

• Variable types

MAD is deeper than Data Cubes: inferential vs. descriptive • Descriptive Data Cubes:

• Roll-up, Drill-down, etc.

• To gain understanding

• Inferential MAD:

• Fit with models: e.g., Gaussian distribution

• Deeper understanding:

• Robust with outliers

• Robust with specific datasets

• Enable advanced tasks:

• Prediction

• Causality analysis

• Distributional comparison

MAD is closer to data than Stats software

• Stats software examples: Matlab, R, etc.

• Direct running in database vs. loading to software

• Distributed vs. in-memory data

AgileMagnetic

MAD is a more extensible eco-system than current MapReduce

• Current MapReduce: complicated algorithms are black-boxes

• MAD advocates a more extensible and modifiable eco-system

• Currently: SQL-based

• Possibly: MapReduce-based

DeepAgileMagnetic

• MAD design

• Variable types

Fox Audience Network

• Served MySpace.com, IGN.com, Scout.com etc.

• About 150 Million users

• Ad network, bought by Rubicon in Nov 2010

It is big!• 42 Nodes (2 Masters, 40 Workers) Sun X4500

• Thumper

• 48 500GB drives

• 16GB RAM

• ~ 5TB Daily

• 1 Table 1.5 Trillion Rows

• Many different types of users workloads

• Dynamic query ecosystems

Magnetic

It has various query types• How many female WWF enthusiasts under the age

of 30 visited the Toyota community over the last four days and saw a medium rectangle?

• Ad hoc + Expensive -> fast?

• How are these people similar to those that visited Nissan?

• Open-ended, requires some statistics and the analyst to be in the loop.

• MAD design

• Variable types

MAD Design Requirements• Loading shouldn’t be too long

• Integration and cleaning routines are unaffordable

• Analysts tolerates noise, in exchange for

• Being the first to analyze data

• Requiring sophisticated analysis

• Besides, it is advised to have a single data center

• Not decentralized physically but logically

• Thus:

• Data warehouse can be improved gradually

• Analysts must be armed with necessary tools

AgileDeep

Magnetic

MAD philosophy: 3 logical layers to allow analysts to touch data as soon as possible

• Reporting – Specialized static aggregates

• For novice analysts

• Specialized

• Tuned for performance

• Production – Aggregates used by most users

• For more advanced analysts

• Armed with common aggregations

• Stages – Raw tables and logs

• For engineers & some analysts

• Besides, Sandboxes – Play ground for analysts

Challenge: Provide powerful tools for analysts to agilely keep up with magnetic

data & go deep!

• MAD design

• Variable types

MADlib = “RDBMS” + Stats + Math + Machine learning

• Build on RDBMS (PostgresQL)

• One single type: Scalar (value)

• Advance to more Math-friendly Stats-friendly types and corresponding operations:

Scalar: 0.1, 0.2…

Vector: [0.1, 0.2] [0.2, 0.4]

Matrix: [[0.1, 0.2], [0.2, 0.4]]

Function: probability density function f(.)

Functional: Mann-Whitney U test distribution f(.) and g(.);

• Enable Stats-friendly and ML-friendly operations:

Resampling

• Through User Defined functions (UDF)

Magnetic

Deep Agile

Scalar operations have been implemented in RDBMS

• SELECT 5*4;

• SELECT sqrt(64);

• SELECT cos(-3.14159 * sqrt(2) / 2 );

Vectors/Matrices can be considered as relation objects in Object-Relational Database• Matrix = (row_number integer, vector numeric[])

• Postgres has the extension!

• Summation, product, dot product are trivial

Matrices may have other data layouts to facilitate a particular operator like transpose

• If using previous representation

• If using sparse representation:

• (row number, column number, value)

• Trivial transpose

• Fast multiplication

Application Example: cosine similarity for fraud detection

• Scenario: Detect similar docs (measured by cosine similarity) promoted by different advertisers:

• They are usually fraudulent

• The advertisers usually use stolen credit cards

• Using matrix operators, the implementation is natural:

Not black box

Other examples in the paper• Ordinary Least Square

• Using pseudo inverse matrix routine

• Found in Math textbook

• Also applied for matrix division

• Conjugate Gradient

• iterative

• efficient?

• Support Vector Machine

Using the existing operators, analysts can

solve a number of complicated problems.

Before that, they used to load data into R, which is

Function: UDF ~ trivial

• Correct me if I am wrong!

Functional example: Mann-Whitney U Test (MWU)

• Scenarios:

• Web companies compare user experiences from different versions of their website to find the best.

• Ad companies compare different ad campaigns and to find the one with the highest clicks-through rate

• for non-parametric = data set that does not fit in a well known distribution

• Calculation involves some counts

MWU implementation

• No blackbox • Direct computation

in the database • Easy-to-use

interface

Other example: Log-likelihood ratio

• Binomial distribution

• Multinomial distribution

• Questions/Comments?

Resampling Implementation

Create 10000 trials, each has size 3, as a view

Specify experiment (e.g., avg each subsample) by view

Run experiment by a single query

• MAD design

• Variable types

• MAD system implementation (MAD RDBMS)

MAD RDBMS• Magnetic

• Get data painlessly

• Agility

• Efficient & Adaptive physical storage

• Deep

• Flexible programming eco-system

MAD Loading/Uploading• Scatter/Gather Streaming

• Share-nothing

• Coordination with external data:

• Data are queried while streaming

• Fast

• 4T/hour with minimum impact on current DB operations

• Greenplum has MapReduce support!

AgileMagnetic

MAD Storage• Tunable table types for different stages:

• external tables (e.g. files)

• heap tables (frequent updates)

• append-only tables (rare updates)

• column-stores flexibility

• Users can specify distribution policy

AgileMagnetic

MAD Partitioning• Partition by range of values or columns (list)

• i.e. partition by timestamp old stuff goes to compressed table, new stuff goes to heap storage.

• Query optimizer knows the partitioning scheme

• Users can delay using partitions until partitioning is complete

AgileMagnetic

MAD Programming

• Flexible in coding: extensible library

• Flexible in programming metaphors: MapReduce vs. SQL

• Programmers must think out the code works w/o shared memory. (data-parallel)???

• MAD design

• Variable types

Directions

• Package management and reuse

• Co-optimizing storage and queries for linear algebra

• Automating physical design for iterative tasks

• Online query processing for MAD analytics

–Authors

“The question is not whether to get MAD, but how and when”

Questions

• Do Spark/Spark/BlinkDB provide better “how”?

• It is unclear how they handle parallel processing

• Is that implied when using SQL and share-nothing architecture?

Thank you!

MAD Skills: New Analysis Practices for Big Data

Engineering

Mad skills new analysis practices for big data

MAD Skills: New Analysis Practices for Big Data · MAD Skills: New Analysis Practices for Big Data . ... • Agile - Analysts need to injest and analyze in the fly • Deep – Sophisticated

Catalogo mad is mad

Chapter 14 – Operational Skills and Practices - Chapter 14... · This chapter consolidates matters relating to operational skills and practices employed by police. Officers should

It’s a Mad, Mad, Mad, Mad World

MAD Skills: New Analysis Practices for Big Dataadityagp/courses/cs598/slides/mad.pdfmad skills: new analysis practices for big data JeffreyCohen,BrianDolan,MarkDunlap,JosephM.Hellerstein,CalebWelton

Chapter 14 – Operational Skills and Practices

BEST PRACTICES IN TEACHING SKILLS COURSES ONLINE

MAD Skills: New Analysis Practices for Big Data · that enable agile design and flexible algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms

Management Skills and Operation Practices Managerial ...research.rmutsb.ac.th/fullpaper/2558/2558240240320.pdf · Found that management skills and operation practices managerial skills

Flemington Primary SchoolTea-Party! mad mad, mad, mad, mad,nad, mad, mad, mad Tea Party! That's the sort of party We like bestl e change the rules And etiquettå. What youlsee 'is

You got mad skills bro!! The Game Show

Ice Breaker: Valentine’s Day Mad Libs (Life Skills)aacmentors.org/wp-content/uploads/2019/02/3-6FebIceBreaker.pdf · funny, Valentine’s Day Mad Lib! Materials: • One Mad Lib

Its a Mad, Mad, Mad, Mad World Wide Web Stafford Kendall Principal, Covalent Logic

Greening skills: Research and Practices from Asia and the ...greenskills.co.za/wp-content/uploads/2016/02/Pavlova-M_greening-skills.pdf · Greening skills: Research and Practices

BEST PRACTICES FOR TEACHING QUANTITATIVE SKILLS

International Skills-based Volunteering Best Practices

ULTRA PANAVISION 70 · Compiled by Doug Louden ITS A MAD, MAD, MAD, MAD, WORLD 3rd December 1964 to 5th April 1966 – Season 69 weeks and 5 days √ ITS A MAD, MAD, MAD, MAD WORLD

AP History Disciplinary Practices and Reasoning Skills · Advanced Placement World History Disciplinary Practices and Reasoning Skills, ... • Patterns of settlement ... (or Huang

Mad Fingerpaint Skills Vol. 3