View
68
Download
2
Category
Preview:
DESCRIPTION
Another slides presenting the paper
Citation preview
MAD Skills: New Analysis Practices for Big Data
Slides courtesy of original paper & Christan Grant’s slides Presented by Long Pham
11/11/2014
MAD Skills: New Analysis Practices for Big Data
• Authors
• Jeff Cohen – Greenplum
• Brian Dolan – Fox Audience Network
• Mark Dunlap – Evergreen Technologies
• Joseph M Hellerstein – UC Berkeley
• Caleb Welton - Greenplum
• Presented at Very Large Database Conference 2009 in Lyon, France
2
What not to expect…
• Smart system supports novice users
• Diagrams + Pictures
• Quantitative experiments
• Formal proof
3
What to expect…• Smart system supports smart users
• Analysis
• Implementation examples
• Real application scenarios
• Reflection & Future discussion
4
• If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap.
• So what’s getting ubiquitous and cheap? Data.
• And what is complementary to data? Analysis.
• – Prof. Hal Varian, Chief Economist at Google
5
Traditionally, data analytics (OLAP) = well-structured data warehouse
• Single expensive center dedicated for analytics (which is separate from OLTP)
• Pre-materialization for pre-defined tasks
• Jealously guarded by engineers
• To ensure high quality integration
6
Things are changing towards decentralized analytics centers• Cheap storage
• World largest 10 years ago ~ $100 nowadays
• Massive data
• Even from a single source like clicks
• Popular analytics
• Proved to be profitable
7
New paradigm is needed: Magnetic Agile Deep
• Magnetic: attracts all data sources regardless quality
• vs. a single center
• Agile: : continuously adaptive structure
• vs. a rigid well-structured architecture
• Deep : supports sophisticated algorithms
• vs. limits within roll-up, drill-down, etc.
8
Magnetic
Agile
Deep
Agenda• MAD vs. traditions
• Scenario: Fox Audience Network (FAN)
• MAD design
• MAD algebra implementation
• Variable types
• Corresponding operators
• MAD system implementation
• Reflections & Directions
9
Agenda• MAD vs. traditions
• Scenario: Fox Audience Network (FAN)
• MAD design
• MAD algebra implementation
• Variable types
• Corresponding operators
• MAD system implementation
• Reflections & Directions
10
MAD is deeper than Data Cubes: inferential vs. descriptive • Descriptive Data Cubes:
• Roll-up, Drill-down, etc.
• To gain understanding
11
Deep
• Inferential MAD:
• Fit with models: e.g., Gaussian distribution
• Deeper understanding:
• Robust with outliers
• Robust with specific datasets
• Enable advanced tasks:
• Prediction
• Causality analysis
• Distributional comparison
MAD is closer to data than Stats software
• Stats software examples: Matlab, R, etc.
• Direct running in database vs. loading to software
• Distributed vs. in-memory data
12
AgileMagnetic
MAD is a more extensible eco-system than current MapReduce
• Current MapReduce: complicated algorithms are black-boxes
• MAD advocates a more extensible and modifiable eco-system
• Currently: SQL-based
• Possibly: MapReduce-based
13
DeepAgileMagnetic
Agenda• MAD vs. traditions
• Scenario: Fox Audience Network (FAN)
• MAD design
• MAD algebra implementation
• Variable types
• Corresponding operators
• MAD system implementation
• Reflections & Directions
14
Fox Audience Network
• Served MySpace.com, IGN.com, Scout.com etc.
• About 150 Million users
• Ad network, bought by Rubicon in Nov 2010
15
It is big!• 42 Nodes (2 Masters, 40 Workers) Sun X4500
• Thumper
• 48 500GB drives
• 16GB RAM
• ~ 5TB Daily
• 1 Table 1.5 Trillion Rows
• Many different types of users workloads
• Dynamic query ecosystems
16
Magnetic
It has various query types• How many female WWF enthusiasts under the age
of 30 visited the Toyota community over the last four days and saw a medium rectangle?
• Ad hoc + Expensive -> fast?
• How are these people similar to those that visited Nissan?
• Open-ended, requires some statistics and the analyst to be in the loop.
17
Agile
Deep
Agenda• MAD vs. traditions
• Scenario: Fox Audience Network (FAN)
• MAD design
• MAD algebra implementation
• Variable types
• Corresponding operators
• MAD system implementation
• Reflections & Directions
18
MAD Design Requirements• Loading shouldn’t be too long
• Integration and cleaning routines are unaffordable
• Analysts tolerates noise, in exchange for
• Being the first to analyze data
• Requiring sophisticated analysis
• Besides, it is advised to have a single data center
• Not decentralized physically but logically
• Thus:
• Data warehouse can be improved gradually
• Analysts must be armed with necessary tools
19
AgileDeep
Magnetic
Magnetic
MAD philosophy: 3 logical layers to allow analysts to touch data as soon as possible
• Reporting – Specialized static aggregates
• For novice analysts
• Specialized
• Tuned for performance
• Production – Aggregates used by most users
• For more advanced analysts
• Armed with common aggregations
• Stages – Raw tables and logs
• For engineers & some analysts
• Besides, Sandboxes – Play ground for analysts
20
Challenge: Provide powerful tools for analysts to agilely keep up with magnetic
data & go deep!
21
Agenda• MAD vs. traditions
• Scenario: Fox Audience Network (FAN)
• MAD design
• MAD algebra implementation
• Variable types
• Corresponding operators
• MAD system implementation
• Reflections & Directions
22
MADlib = “RDBMS” + Stats + Math + Machine learning
• Build on RDBMS (PostgresQL)
• One single type: Scalar (value)
• Advance to more Math-friendly Stats-friendly types and corresponding operations:
Scalar: 0.1, 0.2…
Vector: [0.1, 0.2] [0.2, 0.4]
Matrix: [[0.1, 0.2], [0.2, 0.4]]
Function: probability density function f(.)
Functional: Mann-Whitney U test distribution f(.) and g(.);
• Enable Stats-friendly and ML-friendly operations:
Resampling
• Through User Defined functions (UDF)
23
Agile
Deep
Magnetic
Agile
Deep Agile
Scalar operations have been implemented in RDBMS
• SELECT 5*4;
• SELECT sqrt(64);
• SELECT cos(-3.14159 * sqrt(2) / 2 );
24
Vectors/Matrices can be considered as relation objects in Object-Relational Database• Matrix = (row_number integer, vector numeric[])
• Postgres has the extension!
• Summation, product, dot product are trivial
25
Matrices may have other data layouts to facilitate a particular operator like transpose
• If using previous representation
26
• If using sparse representation:
• (row number, column number, value)
• Trivial transpose
• Fast multiplication
Application Example: cosine similarity for fraud detection
• Scenario: Detect similar docs (measured by cosine similarity) promoted by different advertisers:
• They are usually fraudulent
• The advertisers usually use stolen credit cards
• Using matrix operators, the implementation is natural:
27
Not black box
Other examples in the paper• Ordinary Least Square
• Using pseudo inverse matrix routine
• Found in Math textbook
• Also applied for matrix division
• Conjugate Gradient
• iterative
• efficient?
• Support Vector Machine
28
Using the existing operators, analysts can
solve a number of complicated problems.
Before that, they used to load data into R, which is
slow
Function: UDF ~ trivial
• Correct me if I am wrong!
29
Functional example: Mann-Whitney U Test (MWU)
• Scenarios:
• Web companies compare user experiences from different versions of their website to find the best.
• Ad companies compare different ad campaigns and to find the one with the highest clicks-through rate
• for non-parametric = data set that does not fit in a well known distribution
• Calculation involves some counts
30
MWU implementation
31
• No blackbox • Direct computation
in the database • Easy-to-use
interface
Other example: Log-likelihood ratio
• Binomial distribution
• Multinomial distribution
• Questions/Comments?
32
Resampling Implementation
33
Create 10000 trials, each has size 3, as a view
Specify experiment (e.g., avg each subsample) by view
Run experiment by a single query
Agenda• MAD vs. traditions
• Scenario: Fox Audience Network (FAN)
• MAD design
• MAD algebra implementation
• Variable types
• Corresponding operators
• MAD system implementation (MAD RDBMS)
• Reflections & Directions
34
MAD RDBMS• Magnetic
• Get data painlessly
• Agility
• Efficient & Adaptive physical storage
• Deep
• Flexible programming eco-system
35
MAD Loading/Uploading• Scatter/Gather Streaming
• Share-nothing
• Coordination with external data:
• Data are queried while streaming
• Fast
• 4T/hour with minimum impact on current DB operations
• Greenplum has MapReduce support!
36
AgileMagnetic
MAD Storage• Tunable table types for different stages:
• external tables (e.g. files)
• heap tables (frequent updates)
• append-only tables (rare updates)
• column-stores flexibility
• Users can specify distribution policy
37
AgileMagnetic
MAD Partitioning• Partition by range of values or columns (list)
• i.e. partition by timestamp old stuff goes to compressed table, new stuff goes to heap storage.
• Query optimizer knows the partitioning scheme
• Users can delay using partitions until partitioning is complete
38
AgileMagnetic
MAD Programming
• Flexible in coding: extensible library
• Flexible in programming metaphors: MapReduce vs. SQL
• Programmers must think out the code works w/o shared memory. (data-parallel)???
39
Deep
Agenda• MAD vs. traditions
• Scenario: Fox Audience Network (FAN)
• MAD design
• MAD algebra implementation
• Variable types
• Corresponding operators
• MAD system implementation
• Reflections & Directions
40
Directions
• Package management and reuse
• Co-optimizing storage and queries for linear algebra
• Automating physical design for iterative tasks
• Online query processing for MAD analytics
41
–Authors
“The question is not whether to get MAD, but how and when”
”
42
Questions
• Do Spark/Spark/BlinkDB provide better “how”?
• It is unclear how they handle parallel processing
• Is that implied when using SQL and share-nothing architecture?
43
Thank you!
44
Recommended