193
Statistical Computing For Big Data Deepak Agarwal, Liang Zhang LinkedIn Applied Relevance Science JSM 2013, Montreal, Canada

Statistical Computing for Big data · Structure of This Tutorial • Part I: Introduction to Map-Reduce and the Hadoop System – Overview of Distributed Computing – Introduction

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Statistical Computing

    For Big Data

    Deepak Agarwal, Liang Zhang

    LinkedIn Applied Relevance Science

    JSM 2013, Montreal, Canada

  • Structure of This Tutorial

    • Part I: Introduction to Map-Reduce and the HadoopSystem– Overview of Distributed Computing

    – Introduction to Map-Reduce

    – Introduction to the Hadoop System

    – The Pig Language

    – A Deep Dive of Hadoop Map-Reduce

    • Part II: Examples of Statistical Computing for Big Data– Bag of Little Bootstraps

    – Large Scale Logistic Regression

    – Parallel Matrix Factorization

    • Part III: The Future of Cloud Computing

  • Structure of This Tutorial

    • Part I: Introduction to Map-Reduce and the HadoopSystem– Overview of Distributed Computing

    – Introduction to Map-Reduce

    – Introduction to the Hadoop System

    – The Pig Language

    – A Deep Dive of Hadoop Map-Reduce

    • Part II: Examples of Statistical Computing for Big Data– Bag of Little Bootstraps

    – Large Scale Logistic Regression

    – Parallel Matrix Factorization

    • Part III: The Future of Cloud Computing

  • Big Data becoming Ubiquitous

    • Bioinformatics

    • Astronomy

    • Internet

    • Telecommunications

    • Climatology

    • …

  • Big Data: Some size estimates

    • 1000 human genomes: > 100TB of data (1000 genomes project)

    • Sloan Digital Sky Survey: 200GB data per night (>140TB aggregated)

    • Facebook: A billion monthly active users

    • LinkedIn: 225M members worldwide

    • Twitter: 500 million tweets a day

    • Over 6 billion mobile phones in the world generating data everyday

  • Big Data: Paradigm shift

    • Classical Statistics– Generalize using small data

    • Paradigm Shift with Big Data– We now have an almost infinite supply of data– Easy Statistics ? Just appeal to asymptotic theory?

    • So the issue is mostly computational?

    – Not quite• More data comes with more heterogeneity

    • Need to change our statistical thinking to adapt– Classical statistics still invaluable to think about big data analytics

  • Some Statistical Challenges

    • Exploratory Analysis (EDA), Visualization– Retrospective (on Terabytes)

    – More Real Time (streaming computations every few minutes/hours)

    • Statistical Modeling– Scale (computational challenge)

    – Curse of dimensionality • Millions of predictors, heterogeneity

    – Temporal and Spatial correlations

  • Statistical Challenges continued

    • Experiments

    – To test new methods, test hypothesis from

    randomized experiments

    – Adaptive experiments

    • Forecasting

    – Planning, advertising

    • Many more we are not fully well versed in

  • Defining Big Data

    • How to know you have the big data problem?

    – Is it only the number of terabytes ?

    – What about dimensionality,

    structured/unstructured, computations

    required,…

    • No clear definition, let’s make up one

    – When desired computation cannot be completed

    in the stipulated time with current best algorithm

    using cores available on a commodity PC

    – Agree ? Other suggestions ?

  • Distributed Computing for Big Data

    • Distributed computing invaluable tool to scale computations for big data

    • Some distributed computing models

    – Multi-threading

    – Graphics Processing Units (GPU)

    – Message Passing Interface (MPI)

    – Map-Reduce

  • Evaluating a method for a problem

    • Scalability– Process X GB in Y hours

    • Ease of use for a statistician

    • Reliability (fault tolerance)

    • Cost– Hardware and cost of maintaining

    • Good for the computations required?– E.g., Iterative versus one pass

    • Resource sharing

  • Multithreading

    • Multiple threads take advantage of multiple CPUs

    • Shared memory

    • Threads can execute independently and concurrently

    • Can only handle Gigabytes of data

    • Reliable

  • Graphics Processing Units (GPU)

    • Number of cores:– CPU: Order of 10– GPU: smaller cores

    • Order of 1000

    • Can be >100x faster than CPU– Parallel computationally intensive tasks off-loaded to GPU

    • Good for certain computationally-intensive tasks

    • Can only handle Gigabytes of data

    • Not trivial to use, requires good understanding of low-level architecture for efficient use

  • Message Passing Interface (MPI)

    • Language independent communication protocol among processes (e.g. computers)

    • Most suitable for master/slave model

    • Can handle Terabytes of data

    • Good for iterative processing

    • Fault tolerance is low

  • Map-Reduce (Dean & Ghemawat,

    2004)

    Mappers

    Reducers

    Data

    Output

    • Computation split to Map (scatter) and Reduce (gather) stages

    • Easy to Use: – User needs to implement two

    functions: Mapper and Reducer

    • Easily handles Terabytes of data

    • Very good fault tolerance (failed tasks automatically get restarted)

  • Comparison of Distributed Computing Methods

    Multithreading GPU MPI Map-Reduce

    Scalability (data

    size)

    Gigabytes Gigabytes Terabytes Terabytes

    Fault Tolerance High High Low High

    Maintenance Cost Low Medium Medium Medium-High

    Iterative Process

    Complexity

    Cheap Cheap Cheap Usually

    expensive

    Resource Sharing Hard Hard Easy Easy

    Easy to Implement? Easy Needs

    understanding

    of low-level GPU

    architecture

    Easy Easy

  • Example Problem

    • Tabulating word counts in corpus of documents

    • Similar to table function in R

    • Single machine: Go through each word in each document and count, however– Documents are stored in different nodes!

    – Takes forever for big corpus

    • MPI: Each slave node takes a subset of documents; Master node summarizes the result from the slaves, however– Both master node and slave nodes may fail!

  • Word Count Through Map-Reduce

    Hello World

    Bye World

    Hello Hadoop

    Goodbye Hadoop

    Mapper 1

    Mapper 2

    Reducer 1

    Words from A-G

    Reducer 2

    Words from H-Z

  • Key Ideas about Map-Reduce

    Big Data

    Partition 1 Partition 2 … Partition N

    Mapper 1 Mapper 2 … Mapper N

    Reducer 1 Reducer 2 Reducer M…

    Output 1 Output 1Output 1Output 1

  • Key Ideas about Map-Reduce

    • Data are split into partitions and stored in many different machines on disk (distributed storage)

    • Mappers process data chunks independently and emit pairs

    • Data with the same key are sent to the same reducer. One reducer can receive multiple keys

    • Every reducer sorts its data by key

    • For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output

  • Compute Mean for Each Group

    ID Group No. Score

    1 1 0.5

    2 3 1.0

    3 1 0.8

    4 2 0.7

    5 2 1.5

    6 3 1.2

    7 1 0.8

    8 2 0.9

    9 4 1.3

    … … …

  • Key Ideas about Map-Reduce

    • Data are split into partitions and stored in many different machines on disk (distributed storage)

    • Mappers process data chunks independently and emit pairs– For each row:

    • Key = Group No.

    • Value = Score

    • Data with the same key are sent to the same reducer. One reducer can receive multiple keys– E.g. 2 reducers

    – Reducer 1 receives data with key = 1, 2

    – Reducer 2 receives data with key = 3, 4

    • Every reducer sorts its data by key– E.g. Reducer 1: ,

    • For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output– E.g. Reducer 1 output: ,

  • Key Ideas about Map-Reduce

    • Data are split into partitions and stored in many different machines on disk (distributed storage)

    • Mappers process data chunks independently and emit pairs– For each row:

    • Key = Group No.

    • Value = Score

    • Data with the same key are sent to the same reducer. One reducer can receive multiple keys– E.g. 2 reducers

    – Reducer 1 receives data with key = 1, 2

    – Reducer 2 receives data with key = 3, 4

    • Every reducer sorts its data by key– E.g. Reducer 1: ,

    • For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output– E.g. Reducer 1 output: ,

    What you need

    to implement

  • Mapper:

    Input: Data

    for (row in Data)

    {

    groupNo = row$groupNo;

    score = row$score;

    Output(c(groupNo, score));

    }

    Reducer:Reducer:

    Input: Key (groupNo), List Value (a list of scores that belong to the Key)

    count = 0;

    sum = 0;

    for (v in Value)

    {

    sum += v;

    count++;

    }

    Output(c(Key, sum/count));

    Pseudo Code (in R)

  • Exercise 1

    • Problem: Average height per {Grade, Gender}?

    • What should be the mapper output key?

    • What should be the mapper output value?

    • What are the reducer input?

    • What are the reducer output?

    • Write mapper and reducer for this?

    Student ID Grade Gender Height (cm)

    1 3 M 120

    2 2 F 115

    3 2 M 116

    … … …

  • • Problem: Average height per Grade and Gender?

    • What should be the mapper output key?– {Grade, Gender}

    • What should be the mapper output value?– Height

    • What are the reducer input?– Key: {Grade, Gender}, Value: List of Heights

    • What are the reducer output?– {Grade, Gender, mean(Heights)}

    Student ID Grade Gender Height (cm)

    1 3 M 120

    2 2 F 115

    3 2 M 116

    … … …

  • Exercise 2

    • Problem: Number of students per {Grade, Gender}?

    • What should be the mapper output key?

    • What should be the mapper output value?

    • What are the reducer input?

    • What are the reducer output?

    • Write mapper and reducer for this?

    Student ID Grade Gender Height (cm)

    1 3 M 120

    2 2 F 115

    3 2 M 116

    … … …

  • • Problem: Number of students per {Grade, Gender}?

    • What should be the mapper output key?– {Grade, Gender}

    • What should be the mapper output value?– 1

    • What are the reducer input?– Key: {Grade, Gender}, Value: List of 1’s

    • What are the reducer output?– {Grade, Gender, sum(value list)}

    – OR: {Grade, Gender, length(value list)}

    Student ID Grade Gender Height (cm)

    1 3 M 120

    2 2 F 115

    3 2 M 116

    … … …

  • More on Map-Reduce

    • Depends on distributed file systems

    • Typically mappers are the data storage nodes

    • Map/Reduce tasks automatically get restarted when they fail (good fault tolerance)

    • Map and Reduce I/O are all on disk– Data transmission from mappers to reducers are

    through disk copy

    • Iterative process through Map-Reduce– Each iteration becomes a map-reduce job– Can be expensive since map-reduce overhead is high

  • Introduction To The

    Hadoop System

  • The Apache Hadoop System

    • An open-source software for reliable, scalable, distributed computing

    • The most popular distributed computing system in the world

    • Key modules:– Hadoop Distributed File System (HDFS)

    – Hadoop YARN (job scheduling and cluster resource management)

    – Hadoop MapReduce

  • Major Tools on Hadoop

    • Pig– A high-level language for Map-Reduce computation

    • Hive– A SQL-like query language for data querying via Map-Reduce

    • Hbase– A distributed & scalable database on Hadoop

    – Allows random, real time read/write access to big data

    – Voldemort is similar to Hbase

    • Mahout– A scalable machine learning library

    • …

  • What is Covered In This Tutorial

    • Hadoop Distributed File System (HDFS)

    • Pig

    • A Deep dive of MapReduce and Hadoop

    • Running R on Hadoop

  • Hadoop Installation

    • Setting up Hadoop on your desktop/laptop:

    – http://hadoop.apache.org/docs/stable/single_nod

    e_setup.html

    • Setting up Hadoop on a cluster of machines

    – http://hadoop.apache.org/docs/stable/cluster_set

    up.html

  • Hadoop Distributed File System (HDFS)

    • Master/Slave architecture

    • NameNode: a single master node that controls which data block is stored where.

    • DataNodes: slave nodes that store data and do R/W operations

    • Clients (Gateway): Allow users to login and interact with HDFS and submit Map-Reduce jobs

    • Big data is split to equal-sized blocks, each block can be stored in different DataNodes

    • Disk failure tolerance: data is replicated multiple times

  • A Typical User Session on Hadoop

    • Log in to a Hadoop client machine

    • Interact with HDFS to– Upload the data from local / Locate where the data is

    • Submit a Map-Reduce job

    • Debug the job if it fails via Hadoop Job Tracker

    • Interact with HDFS to– Check the output of the job

    • Copy the output to local for further analysis if needed

    • Log out

  • HDFS Command HDFS FunctionalityAnalogy to Linux Shell

    Command

    hdfs dfs -ls

    returns list of the direct

    children of the path and

    their stats

    ls

    hdfs dfs –cat Output file content to

    stdoutcat

    hdfs dfs –chmod Change Permission chmod

    hdfs dfs –copyFromLocal

    Copy from lcoal path to hdfs cp

    hdfs dfs –copyToLocal

    Copy to local path from hdfs cp

    hdfs dfs –mkdir Create a directory mkdir

    hdfs dfs –rm Remove a file rm

    hdfs dfs –rmr Remove a path rm –r

    hdfs dfs –mv Moves files from source to

    destinationmv

    hdfs dfs –cp Copy files from source to

    destinationcp

    … … …

  • HDFS Demo Time!

  • Pig

    • A high-level platform for Map-Reduce on Hadoop

    • Pig Latin: The SQL-like intuitive language for Pig

    • Can be extended via User Defined Functions (UDF) in Java, Python, etc.

  • Compute Mean for Each Group

    ID Group No. Score

    1 1 0.5

    2 3 1.0

    3 1 0.8

    4 2 0.7

    5 2 1.5

    6 3 1.2

    7 1 0.8

    8 2 0.9

    9 4 1.3

    … … …

  • A = LOAD 'Sample-1.dat' USING PigStorage() AS (ID : int, groupNo: int, score: float);

    B = GROUP A BY groupNo;

    C = FOREACH B GENERATE group, AVG(A.score) AS mean;

    DUMP C;

    ID Group No. Score

    1 1 0.5

    2 3 1.0

    3 1 0.8

    4 2 0.7

    5 2 1.5

    6 3 1.2

    7 1 0.8

    8 2 0.9

    9 4 1.3

    … … …

    File Sample-1.dat

  • Load the Data into Pig

    • A = LOAD 'Sample-1.dat' USING PigStorage() AS (ID : int, groupNo: int, score: float);– The path of the data on HDFS after LOAD

    • USING PigStorage() means delimit the data by tab (can be omitted)

    • If data are delimited by other characters, e.g. space, use USING PigStorage(' ')

    • Data schema defined after AS

    • Variable types: int, long, float, double, chararray, …

  • Tuple

    • A Tuple is a data record consisting of a sequence of "fields”

    • Each field is a piece of data of any type

    • E.g. Each row of A is a Tuple:

    {ID: int, groupNo: int, score: float}

  • Data Bag

    • A Data Bag is a set of tuples

    • You may think of it as a "table”

    • Pig does not require that the tuple field types match, or even that the tuples have the same

    number of fields

  • Filter Data

    • D = FILTER A BY score>0.1 AND score

  • Per Row Operations

    • E = FOREACH A GENERATE groupNo, score, score * score AS scoreSq;

    • FOREACH…GENERATE does independent per-row operations

    • Column “ID” is thrown away

    • A new column “ScoreSq” was generated by simply doing score * score

  • GROUP…BY…

    • B = GROUP A BY groupNo;

    • A: {ID: int, groupNo: int, score: float}

    • B: {group: int, A: {(ID: int, groupNo: int, score: float)}

    • Results of GROUP…BY… always have two fields– Field “group” saves the information of the group.

    Here, it is the same as groupNo.

    – The second field is a data bag named as the variable name after GROUP (e.g. A here), which consists of all the tuples (rows) for this group (e.g. with the same groupNo)

  • AVG (Average) Operation

    • B = GROUP A BY groupNo;

    • C = FOREACH B GENERATE group, AVG(A.score) AS mean;

    • B: {group: int, A: {(ID: int, groupNo: int, score: float)}

    • C: {group: int, mean: double}

    • AVG function computes the average of the numeric values in a single-column of data bag

    • A.score retrieves the “score” column in the field “A” of data B and forms a new data bag

  • Other Sample Functions on Data Bags

    • SUM: Computes the sum of the numeric values in a single-column bag.

    • COUNT: Counts the number of tuples in a data bag

    • MIN / MAX: Computes the min/max of the numeric values in a single-column bag.

    • IsEmpty: Checks if a data bag is empty

    • …

  • Pig Output

    • DUMP function outputs on screen

    • STORE outputs to HDFS storage

    • For example– DUMP C– STORE C into 'output/Output-1' using PigStorage()– Pig stores the data C into directory with path

    output/Output-1

    • STORE creates the directory if it doesn’t exist– The job fails if the directory already exists– Use rmf before STORE function to force

    overwrite

  • Pig Demo 1

  • • Problem: Number of students per {Grade, Gender}?

    Student ID Grade Gender Height (cm)

    1 3 M 120

    2 2 F 115

    3 2 M 116

    … … …

    File Sample-2.dat

    A = LOAD 'Sample-2.dat' USING PigStorage() AS (studentID : int, grade: int, gender:

    chararray, height: int);

    B = GROUP A BY (grade, gender);

    C = FOREACH B GENERATE FLATTEN(group) AS (grade, gender), COUNT(A) AS

    numStudents;

    DUMP C;

  • GROUP…BY…

    • B = GROUP A BY (grade, gender);

    • A: {studentID : int, grade: int, gender: chararray, height: int}

    • B: {group: (grade: int, gender: chararray),A: {(studentID: int, grade: int, gender: chararray,

    height: int)}}

    • The “group” field in B now becomes a Tuple with two fields: grade and gender

  • FLATTEN

    • FLATTEN has to be used with FOREACH…GENERATE…

    • FLATTEN operates on Tuples or Data Bags

    • FLATTEN(a field that is a Tuple) => a set of fields– E.g. A: {a: int, b: (c: int, d: chararray)}

    – B = FOREACH A GENERATE a, FLATTEN(b);

    – B: {a: int, b::c: int, b::d: chararray}

    – B = FOREACH A GENERATE a, FLATTEN(b) AS (c:int, d:chararray);

    – B: {a: int, c:int, d:chararray}

  • FLATTEN

    • FLATTEN(a DataBag) generates N records if the bag has N tuples

    • It cross products with the other fields in the data

    • For example– A: {a: int, b: {(c: int, d: chararray)}}

    – The field b is a Data Bag that contains Tuples with two fields: c and d

    – B = FOREACH A GENERATE a, FLATTEN(b) as (c:int, d:chararray);

    A

    (1, {(2,M), (5,F)})

    (3, {(4,F), (6,M), (7,F)}

    B

    (1, 2, M)

    (1, 5, F)

    (3, 4, F)

    (3, 6, M)

    (3, 7, F)

  • Pig Demo 2

  • JOINA = LOAD 'Sample

    (

    A = LOAD 'Sample-3a.dat' AS

    (a1:int,a2:int,a3:int);

    DUMP A;

    (1,2,3)

    (4,2,1)

    (8,3,4)

    (4,3,3)

    (7,2,5)

    (8,4,3)

    B = LOAD AS (b1:int,b2:int);

    (4,9)

    B = LOAD 'Sample-3b.dat' AS (b1:int,b2:int);

    DUMP B;

    (2,4)

    (8,9)

    (1,3)

    (2,7)

    (2,9)

    (4,6)

    (4,9)

    X = JOIN A BY a1, B BY b1;

    (

    X = JOIN A BY a1, B BY b1;

    DUMP X;

    (1,2,3,1,3)

    (4,2,1,4,6)

    (4,3,3,4,6)

    (4,2,1,4,9)

    (4,3,3,4,9)

    (8,3,4,8,9)

    (8,4,3,8,9)

  • JOINA = LOAD 'Sample

    (

    A = LOAD 'Sample-3a.dat' AS

    (a1:int,a2:int,a3:int);

    DUMP A;

    (1,2,3)

    (4,2,1)

    (8,3,4)

    (4,3,3)

    (7,2,5)

    (8,4,3)

    B = LOAD AS (b1:int,b2:int);

    (4,9)

    B = LOAD 'Sample-3b.dat' AS (b1:int,b2:int);

    DUMP B;

    (2,4)

    (8,9)

    (1,3)

    (2,7)

    (2,9)

    (4,6)

    (4,9)

    X = JOIN A BY a1, B BY b1;

    (

    X = JOIN A BY a1, B BY b1;

    DUMP X;

    (1,2,3,1,3)

    (4,2,1,4,6)

    (4,3,3,4,6)

    (4,2,1,4,9)

    (4,3,3,4,9)

    (8,3,4,8,9)

    (8,4,3,8,9)

  • JOINA = LOAD 'Sample

    (

    A = LOAD 'Sample-3a.dat' AS

    (a1:int,a2:int,a3:int);

    DUMP A;

    (1,2,3)

    (4,2,1)

    (8,3,4)

    (4,3,3)

    (7,2,5)

    (8,4,3)

    B = LOAD AS (b1:int,b2:int);

    (4,9)

    B = LOAD 'Sample-3b.dat' AS (b1:int,b2:int);

    DUMP B;

    (2,4)

    (8,9)

    (1,3)

    (2,7)

    (2,9)

    (4,6)

    (4,9)

    X = JOIN A BY a1, B BY b1;

    (

    X = JOIN A BY a1, B BY b1;

    DUMP X;

    (1,2,3,1,3)

    (4,2,1,4,6)

    (4,3,3,4,6)

    (4,2,1,4,9)

    (4,3,3,4,9)

    (8,3,4,8,9)

    (8,4,3,8,9)

    Cross-

    Product

  • JOINA = LOAD 'Sample

    (

    A = LOAD 'Sample-3a.dat' AS

    (a1:int,a2:int,a3:int);

    DUMP A;

    (1,2,3)

    (4,2,1)

    (8,3,4)

    (4,3,3)

    (7,2,5)

    (8,4,3)

    B = LOAD AS (b1:int,b2:int);

    (4,9)

    B = LOAD 'Sample-3b.dat' AS (b1:int,b2:int);

    DUMP B;

    (2,4)

    (8,9)

    (1,3)

    (2,7)

    (2,9)

    (4,6)

    (4,9)

    X = JOIN A BY a1, B BY b1;

    (

    X = JOIN A BY a1, B BY b1;

    DUMP X;

    (1,2,3,1,3)

    (4,2,1,4,6)

    (4,3,3,4,6)

    (4,2,1,4,9)

    (4,3,3,4,9)

    (8,3,4,8,9)

    (8,4,3,8,9)

    Records that do not

    match got thrown

    away

  • COGROUPA = LOAD 'Sample

    (

    A = LOAD 'Sample-3a.dat' AS

    (a1:int,a2:int,a3:int);

    DUMP A;

    (1,2,3)

    (4,2,1)

    (8,3,4)

    (4,3,3)

    (7,2,5)

    (8,4,3)

    B = LOAD AS (b1:int,b2:int);

    (4,9)

    B = LOAD 'Sample-3b.dat' AS (b1:int,b2:int);

    DUMP B;

    (2,4)

    (8,9)

    (1,3)

    (2,7)

    (2,9)

    (4,6)

    (4,9)

    W = COGROUP A BY a1, B BY b1;

    DUMP W;

    (1,{(1,2,3)},{(1,3)})

    (2,{},{(2,4),(2,7),(2,9)})

    (4,{(4,2,1),(4,3,3)},{(4,6),(4,9)})

    (7,{(7,2,5)},{})

    (8,{(8,3,4),(8,4,3)},{(8,9)})

    DESCRIBE W;

    W: {group: int, A: {(a1: int,a2: int,a3: int)}, B: {(b1: int,b2: int)}}

  • COGROUP and JOIN

    • X = JOIN A BY a1, B BY b1;

    • X: {A::a1: int,A::a2: int,A::a3: int,B::b1: int,B::b2: int}

    • W = COGROUP A BY a1, B BY b1;

    • X = FOREACH W GENERATE FLATTEN(A), FLATTEN(B);

    • X: {A::a1: int,A::a2: int,A::a3: int,B::b1: int,B::b2: int}

  • COGROUP and JOIN

    • X = JOIN A BY a1, B BY b1;

    • X: {A::a1: int,A::a2: int,A::a3: int,B::b1: int,B::b2: int}

    • W = COGROUP A BY a1, B BY b1;

    • X = FOREACH W GENERATE FLATTEN(A), FLATTEN(B);

    • X: {A::a1: int,A::a2: int,A::a3: int,B::b1: int,B::b2: int}

    Equivalent

  • COGROUP and JOIN

    • X = JOIN A BY a1, B BY b1;

    • X: {A::a1: int,A::a2: int,A::a3: int,B::b1: int,B::b2: int}

    • W = COGROUP A BY a1, B BY b1;

    • X = FOREACH W GENERATE FLATTEN(A), FLATTEN(B);

    • X: {A::a1: int,A::a2: int,A::a3: int,B::b1: int,B::b2: int}

    Equivalent

    COGROUP is more

    flexible

  • Pig Demo 3

  • • Students at schools take tests (e.g. SAT)

    • One student can take the same test multiple times

    • We take Max(scores) from one student as his/her final score

    • Problem: The top 3 schools with most number of students having final score > 90?

    Student ID School Score

    1 B 89

    1 B 90

    3 B 75

    2 A 60

    4 C 90

    … … …

    File Sample-4.dat

  • • Problem: The top 3 schools with most number of students having final score > 90?

    A = LOAD A = LOAD 'Sample-4.dat' AS (studentID: int, school: chararray, score:

    int);

    B = GROUP A BY school;

    C = FOREACH B {

    D = FILTER A BY score>90;

    E = DISTINCT D.studentID;

    GENERATE FLATTEN(group) AS school, COUNT(E) AS count;

    }

    F = ORDER C BY count DESC;

    G = LIMIT F 3;

    DUMP G;

  • • Problem: The top 3 schools with most number of students having final score > 90?

    A = LOAD A = LOAD 'Sample-4.dat' AS (studentID: int, school: chararray, score:

    int);

    B = GROUP A BY school;

    C = FOREACH B {

    D = FILTER A BY score>90;

    E = DISTINCT D.studentID;

    GENERATE FLATTEN(group) AS school, COUNT(E) AS count;

    }

    F = ORDER C BY count DESC;

    G = LIMIT F 3;

    DUMP G;

    • Nested Block for FOREACH…GENERATE

    • The last statement in the nested block must be GENERATE.

  • • Problem: The top 3 schools with most number of students having final score > 90?

    A = LOAD A = LOAD 'Sample-4.dat' AS (studentID: int, school: chararray, score:

    int);

    B = GROUP A BY school;

    C = FOREACH B {

    D = FILTER A BY score>90;

    E = DISTINCT D.studentID;

    GENERATE FLATTEN(group) AS school, COUNT(E) AS count;

    }

    F = ORDER C BY count DESC;

    G = LIMIT F 3;

    DUMP G;

    • B: {group: int, A: {(studentID: int, school: chararray, score: int)}}

    • For each record in B, A is a field that is a Data Bag that stores the

    data from the group (school).

    • D filters the Data Bag A by score>90 and generates a new Data Bag

    • E takes the student ID field from D, forms a new Data Bag with

    single field tuples (D.studentID), and performs “unique” operation

    by DINSTINCT

  • • Problem: The top 3 schools with most number of students having final score > 90?

    A = LOAD A = LOAD 'Sample-4.dat' AS (studentID: int, school: chararray, score:

    int);

    B = GROUP A BY school;

    C = FOREACH B {

    D = FILTER A BY score>90;

    E = DISTINCT D.studentID;

    GENERATE FLATTEN(group) AS school, COUNT(E) AS count;

    }

    F = ORDER C BY count DESC;

    G = LIMIT F 3;

    DUMP G;

    • ORDER…BY is the sort function implemented in map-reduce

    • DESC means sort in the descending order. If without DESC, it means

    ascending order.

    • Q: What does this mean?

    H = ORDER A by score DESC, studentID;

  • • Problem: The top 3 schools with most number of students having final score > 90?

    A = LOAD A = LOAD 'Sample-4.dat' AS (studentID: int, school: chararray, score:

    int);

    B = GROUP A BY school;

    C = FOREACH B {

    D = FILTER A BY score>90;

    E = DISTINCT D.studentID;

    GENERATE FLATTEN(group) AS school, COUNT(E) AS count;

    }

    F = ORDER C BY count DESC;

    G = LIMIT F 3;

    DUMP G;

    • LIMIT operation here picks the top 3 records from all the records

  • Pig Demo 4

  • User Defined Functions in Pig (UDF)

    • Supports customized computation in mapper/reducer stages

    • Best to write in Java

    • Also supports Python

  • Calling Java UDF in Pig

    • Example: Change the name field to upper case

    REGISTER myudfs.jar;

    A = LOAD 'student_data' AS (name: chararray, age:

    int, gpa: float);

    B = FOREACH A GENERATE myudfs.UPPER(name);

    DUMP B;

  • Calling Java UDF in Pig

    • Example: Change the name field to upper case

    REGISTER myudfs.jar;

    A = LOAD 'student_data' AS (name: chararray, age:

    int, gpa: float);

    B = FOREACH A GENERATE myudfs.UPPER(name);

    DUMP B;

    Register the jar file that contains

    The UDF java class

  • Calling Java UDF in Pig

    • Example: Change the name field to upper case

    REGISTER myudfs.jar;

    A = LOAD 'student_data' AS (name: chararray, age:

    int, gpa: float);

    B = FOREACH A GENERATE myudfs.UPPER(name);

    DUMP B;

    Calling the UDF to change name

    to upper case

  • package

    }

    package myudfs;

    import java.io.IOException;

    import org.apache.pig.EvalFunc;

    import org.apache.pig.data.Tuple;

    import org.apache.pig.impl.util.WrappedIOException;

    public class UPPER extends EvalFunc (String) {

    public String exec(Tuple input) throws IOException {

    if (input == null || input.size() == 0) return null;

    try{

    String str = (String) input.get(0);

    return str.toUpperCase();

    } catch(Exception e){

    throw WrappedIOException.wrap("Caught

    exception processing input row ", e);

    }

    }

    }

  • package

    }

    package myudfs;

    import java.io.IOException;

    import org.apache.pig.EvalFunc;

    import org.apache.pig.data.Tuple;

    import org.apache.pig.impl.util.WrappedIOException;

    public class UPPER extends EvalFunc (String) {

    public String exec(Tuple input) throws IOException {

    if (input == null || input.size() == 0) return null;

    try{

    String str = (String) input.get(0);

    return str.toUpperCase();

    } catch(Exception e){

    throw WrappedIOException.wrap("Caught

    exception processing input row ", e);

    }

    }

    }

    Used in

    FOREACH…GENERA

    TE context

  • package

    }

    package myudfs;

    import java.io.IOException;

    import org.apache.pig.EvalFunc;

    import org.apache.pig.data.Tuple;

    import org.apache.pig.impl.util.WrappedIOException;

    public class UPPER extends EvalFunc (String) {

    public String exec(Tuple input) throws IOException {

    if (input == null || input.size() == 0) return null;

    try{

    String str = (String) input.get(0);

    return str.toUpperCase();

    } catch(Exception e){

    throw WrappedIOException.wrap("Caught

    exception processing input row ", e);

    }

    }

    }

    Output type

  • package

    }

    package myudfs;

    import java.io.IOException;

    import org.apache.pig.EvalFunc;

    import org.apache.pig.data.Tuple;

    import org.apache.pig.impl.util.WrappedIOException;

    public class UPPER extends EvalFunc (String) {

    public String exec(Tuple input) throws IOException {

    if (input == null || input.size() == 0) return null;

    try{

    String str = (String) input.get(0);

    return str.toUpperCase();

    } catch(Exception e){

    throw WrappedIOException.wrap("Caught

    exception processing input row ", e);

    }

    }

    }

    Input type is always Tuple, which is a Tuple of parameters of the

    function

  • package

    }

    package myudfs;

    import java.io.IOException;

    import org.apache.pig.EvalFunc;

    import org.apache.pig.data.Tuple;

    import org.apache.pig.impl.util.WrappedIOException;

    public class UPPER extends EvalFunc (String) {

    public String exec(Tuple input) throws IOException {

    if (input == null || input.size() == 0) return null;

    try{

    String str = (String) input.get(0);

    return str.toUpperCase();

    } catch(Exception e){

    throw WrappedIOException.wrap("Caught

    exception processing input row ", e);

    }

    }

    }

    For this example, only 1 parameter of the

    function, which is the string to be upper cased

  • package

    }

    package myudfs;

    import java.io.IOException;

    import org.apache.pig.EvalFunc;

    import org.apache.pig.data.Tuple;

    import org.apache.pig.impl.util.WrappedIOException;

    public class UPPER extends EvalFunc (String) {

    public String exec(Tuple input) throws IOException {

    if (input == null || input.size() == 0) return null;

    try{

    String str = (String) input.get(0);

    return str.toUpperCase();

    } catch(Exception e){

    throw WrappedIOException.wrap("Caught

    exception processing input row ", e);

    }

    }

    }

    Return the upper-

    cased string

  • Summary of Pig

    • A really nice interactive tool for HadoopMapReduce beginners

    • Efficient for quick statistical data analysis on big data (very little coding time)

    • UDF support enables comprehensive MapReduce jobs

  • A Deep Dive into Hadoop

    Map-Reduce

  • Hadoop Map-Reduce in Java

    • A natural way for implementing Map-Reduce since Hadoop is written in Java

    • A Map-Reduce job class contains at least– A Mapper class

    – A Reducer class

    – A main function that sets up the configs and runs the job

    • More info: http://hadoop.apache.org/docs/stable/mapred_tutorial.html

  • Word Count Through Map-Reduce

    Hello World

    Bye World

    Hello Hadoop

    Goodbye Hadoop

    Mapper 1

    Mapper 2

    Reducer 1

    Words from A-G

    Reducer 2

    Words from H-Z

  • Word Count Through Map-Reduce

    Hello World

    Bye World

    Hello Hadoop

    Goodbye Hadoop

    Mapper 1

    Mapper 2

    Reducer 1

    Words from A-G

    Reducer 2

    Words from H-Z

    Large amount of data

    to transfer from mappers

    to reducers for large data

  • How About…

    Hello World

    Bye World

    Hello Hadoop

    Goodbye Hadoop

    Mapper 1

    Mapper 2

    Reducer 1

    Words from A-G

    Reducer 2

    Words from H-Z

  • Combiners

    • A reducer-like class that happens in mapper stage to aggregate data to sufficient statistics

    before sending to reducers

    • Can significantly improve Hadoop Map-Reduce performance

    • Pig: Automatically applies combiners

  • Hello World

    Bye World

    Hello Hadoop

    Goodbye Hadoop

    Mapper 1

    Mapper 2

    Reducer 1

    Words from A-G

    Reducer 2

    Words from H-Z

    Combiner 1

    Combiner 2

  • Partitioners

    • Determines which pair is sent to which reducer

    • Default: random partitioning based on key’s hash value

    • Can be overwritten according to customized needs

  • Number of Reducers

    • Too small– The job runs forever

    • Too big– Waiting time to obtain many reducer nodes and starting those nodes can be

    long

    – A set of tiny output files cause• Hadoop namespace usage to be inefficient

    • Jobs who read this output as input use unnecessary large number of mappers

    • Ideally– Each reducer runs for at least several minutes, but not too long

    – Output partition size is at least a few hundred MB

    • In Pig– B = GROUP A BY name PARALLEL 10;

    – C = COGROUP A BY name, B BY name PARALLEL 10;

    – B = ORDER A BY NAME PARALLEL 10;

    – …

  • More Efficient Hadoop Jobs

    • Why combiners?– Use combiners if necessary to reduce the amount

    of data transferred from mappers to reducers

    • Why partitioners?– Use partitioners to make sure the amount of

    computation time for each reducer is similar

    • Why optimizing number of reducers?– Optimize Map-Reduce running time

    – Optimize namespace usage

  • More Efficient Pig Scripts

    • Remove unnecessary fields early and often using FOREACH…GENERATE

    • Filter early and often

    • COGROUP and JOIN: smaller files always on the left

    – A = join small by x, large by y;

    • Use PARALLEL to tune the parallelism of the job (Q: Why?)

  • Hadoop Streaming

    • A utility to create and run Map-Reduce with any executable or script as mapper/reducer

    • For example

    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

    -input myInputDirs \

    -output myOutputDir \

    -mapper myPythonScript.py \

    -reducer /bin/wc

    -file myPythonScript.py

  • Hadoop Streaming

    • A utility to create and run Map-Reduce with any executable or script as mapper/reducer

    • For example

    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

    -input myInputDirs \

    -output myOutputDir \

    -mapper myPythonScript.py \

    -reducer /bin/wc

    -file myPythonScript.py

    The python script shipped to clusters during job

    submission

  • Running R using Hadoop Streaming

    • Install R on Hadoop gateway machines

    • Package R dir into R.jar by "jar cvf R.jar -C R/ .”

    • Copy R.jar from local to Hadoop

    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

    -input myInputDirs \

    -output myOutputDir \

    -mapper "myR/bin/R --vanilla --slave --no-restore -f mapper.R" \

    -reducer "myR/bin/R --vanilla --slave --no-restore -f reducer.R" \

    -file mapper.R \

    -file reducer.R \

    -cacheArchive 'hdfs://[nameNodeURL]:9000/[HDFS location]/R.jar#myR' \

    -cmdenv R_HOME_DIR=myR \

  • Running R Using Hadoop

    Streaming

    • cacheArchive ships R.jar to each mapper/reducer node, unzips the jar to

    directory myR, and execute R from myR/bin/R

    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

    -input myInputDirs \

    -output myOutputDir \

    -mapper "myR/bin/R --vanilla --slave --no-restore -f mapper.R" \

    -reducer "myR/bin/R --vanilla --slave --no-restore -f reducer.R" \

    -file mapper.R \

    -file reducer.R \

    -cacheArchive 'hdfs://[nameNodeURL]:9000/[HDFS location]/R.jar#myR' \

    -cmdenv R_HOME_DIR=myR \

  • RHive

    • An R package that interacts with Hive (SQL-like in Hadoop) from R

    • Connects R objects and functions with Hive

    • http://cran.r-project.org/web/packages/RHive/RHive.pdf

  • RHIPE

    • Another R package that aims to interact with Hadoop HDFS and Map-Reduce

    • Sample R functions– rhput: Put a file to HDFS– rhget: Copy a file to local from HDFS– rhmr: Prepares a Map-Reduce job for execution– rhex: Execute a Map-Reduce job– rhkill: Kill a Map-Reduce job– …

    • http://www.datadr.org/getpack.html

  • Summary of Part I

    • Big Data Overview: Statistical perspective

    • Map-Reduce : An increasingly popular computing system to analyze big data

    • Hadoop System: An open source implementation of Map-Reduce

    • Pig: High level Hadoop Map-Reduce language (like R for big data)

  • Structure of This Tutorial

    • Part I: Introduction to Map-Reduce and the HadoopSystem– Overview of Distributed Computing

    – Introduction to Map-Reduce

    – Introduction to the Hadoop System

    – The Pig Language

    – A Deep Dive of Hadoop Map-Reduce

    • Part II: Examples of Statistical Computing for Big Data– Bag of Little Bootstraps

    – Large Scale Logistic Regression

    – Parallel Matrix Factorization

    • Part III: The Future of Cloud Computing

  • Bag of Little Bootstraps

    Kleiner et al. 2012

  • Bootstrap (Efron, 1979)

    • A re-sampling based method to obtain statistical distribution of sample estimators

    • Why are we interested ?– Re-sampling is embarrassingly parallelizable

    • For example:– Standard deviation of the mean of N samples (μ)

    – For i = 1 to r do • Randomly sample with replacement N times from the original

    sample -> bootstrap data i

    • Compute mean of the i-th bootstrap data -> μi– Estimate of Sd(μ) = Sd([μ1,…μr])

    – r is usually a large number, e.g. 200

  • Bootstrap for Big Data

    • Can have r nodes running in parallel, each sampling one bootstrap data

    • However…

    – N can be very large

    – Data may not fit into memory

    – Collecting N samples with replacement on each

    node can be computationally expensive

  • M out of N Bootstrap

    (Bikel et al. 1997)

    • Obtain SdM(μ) by sampling M samples with

    replacement for each bootstrap, where M

  • Bag of Little Bootstraps (BLB)

    • Example: Standard deviation of the mean • Generate S sampled data sets, each obtained by random

    sampling without replacement a subset of size b (or partition the original data into S partitions, each with size b)

    • For each data p = 1 to S do– For i = 1 to r do

    • N samples with replacement on data of size b

    • Compute mean of the resampled data μpi– Compute Sdp(μ) = Sd([μp1,…μpr])

    • Estimate of Sd(μ) = Avg([Sd1(μ),…, SdS(μ)])

  • Bag of Little Bootstraps (BLB)

    • Interest: ξ(θ), where θ is an estimate obtained from size N data– ξ is some function of θ, such as standard deviation, …

    • Generate S sampled data sets, each obtained from random sampling without replacement a subset of size b (or partition the original data into S partitions, each with size b)

    • For each data p = 1 to S do– For i = 1 to r do

    • Sample N samples with replacement on data of size b• Compute the estimate of the resampled data -> θpi

    – Compute ξp(θ) = ξ([θp1,…θpr])

    • Estimate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])

    ***Typo in hand out!

  • Bag of Little Bootstraps (BLB)

    • Interest: ξ(θ), where θ is an estimate obtained from size N data– ξ is some function of θ, such as standard deviation, …

    • Generate S sampled data sets, each obtained from random sampling without replacement a subset of size b (or partition the original data into S partitions, each with size b)

    • For each data p = 1 to S do– For i = 1 to r do

    • Sample N samples with replacement on the data of size b• Compute the estimate of the resampled data -> θpi

    – Compute ξp(θ) = ξ([θp1,…θpr])

    • Estimate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])

    MapperReducer

    Gateway

    ***Typo in hand out!

  • Why is BLB Efficient

    • Before:– N samples with replacement from size N data is

    expensive when N is large

    • Now:– N samples with replacement from size b data

    – b can be several magnitude smaller than N (e.g. b = Nγ, γ in [0.5, 1))

    – Equivalent to: A multinomial sampler with dim = b

    – Storage = O(b), Computational complexity = O(b)

  • Simulation Experiment

    • 95% CI of Logistic Regression Coefficients

    • N = 20000, 10 explanatory variables

    • Relative Error = |Estimated CI width – True CI width | / True CI width

    • BLB-γ: BLB with b = Nγ

    • BOFN-γ: b out of N sampling with b = Nγ

    • BOOT: Naïve bootstrap

  • Simulation Experiment

  • Real Data

    • 95% CI of Logistic Regression Coefficients

    • N = 6M, 3000 explanatory variables

    • Data size = 150GB, r = 50, S = 5, γ = 0.7

    Data stored on disk Data stored in memory using Spark

  • Hyper-parameter Selection

    • From empirical experiments in the paper

    – S>=3 and r>=50 is sufficient for low relative errors

    • Adaptively selecting r and S

    – Increase r and s until estimated value converges

    – For r: Continue to process resamples and update ξp(θ) until it has ceased to change significantly.

    – For S: Continue to process more subsamples (i.e., increasing S) until BLB’s output value has stabilized.

    ***New Slide!

  • Hyper-parameter Selection

    BLB-0.7

    • Adaptively selecting r and S

    – Increase r and S until estimated value converges

    Adaptive BLB

  • Summary of BLB

    • A new algorithm for bootstrapping on big data

    • Advantages

    – Fast and efficient

    – Easy to parallelize

    – Easy to understand and implement

    – Friendly to Hadoop, makes it routine to perform

    statistical calculations on Big data

  • Large Scale Logistic Regression

    Deepak Agarwal, Bee-Chung Chen, Bo Long,

    Liang Zhang, Xianxing Zhang

    Applied Relevance Science at LinkedIn

  • Logistic Regression

    • Binary response: Y

    • Covariates: X

    • Yi ~ Bernoulli(pi)

    • log(pi/(1-pi)) = XiTβ ; β ~ MVN(0 , 1/λ I )

    • Widely used (research and applications)

  • item j from a set of candidates

    User i

    with

    user features(e.g., industry,

    behavioral features,

    Demographic

    features,……)

    (i, j) : response yijvisits

    Algorithm selects

    (click or not)

    Which item should we recommend to the user?

    ���� The item with highest predicted response rate

    • Logistic Regression effective technique

    Response Prediction: Application of logistic regression

    in recommender systems

  • Examples of Recommender Systems

  • Similar problem:

    Content recommendation on Yahoo!

    front page

    Recommend content links

    (out of 30-40, editorially

    programmed)

    4 slots exposed, F1 has

    maximum exposure

    Routes traffic to other Y! properties

    F1 F2 F3 F4

    Today module

  • Right Media Ad Exchange: Unified Marketplace

    Match ads to page views on publisher sites

    Has ad

    impression to

    sell --

    AUCTIONS

    Bids $0.50Bids $0.75 via Network…

    … which becomes

    $0.45 bid

    Bids $0.65—WINS!

    AdSenseAd.com

    Bids $0.60

  • Logistic Regression for

    Response Prediction• Binary response: Y

    – Click / Non-click

    • Covariates: X– User covariates:

    • Age, gender, industry, education, job, job title, …

    – Item covariates:• Categories, keywords, topics, …

    – Context covariates:• Time, page type, position, …

    – 2-way interaction:• User covariates X item covariates• Context covariates X item covariates• …

  • Computational Challenge

    • Hundreds of millions/billions of observations

    • Hundreds of thousands/millions of covariates

    • Fitting such a logistic regression model on a single machine not feasible

    • Model fitting iterative using methods like gradient descent, Newton’s method etc

    – Multiple passes over the data

  • Recap on Optimization method

    • Problem: Find x to min(F(x))

    • Iteration n: xn = xn-1 – bn-1 F’(xn-1)

    • bn-1 is the step size that can change every iteration

    • Iterate until convergence

    • Conjugate gradient, LBFGS, Newton trust region, … all of this kind

  • Iterative Process with Hadoop

    Disk Mappers Disk Reducers

    DiskMappersDiskReducers

    Disk Mappers Disk Reducers

  • Limitations of Hadoop for fitting a big

    logistic regression

    • Iterative process is expensive and slow

    • Every iteration = a Map-Reduce job

    • I/O of mapper and reducers are both through disk

    • Plus: Waiting in queue time

    • Q: Can we find a fitting method that scales with Hadoop ?

  • Large Scale Logistic Regression

    • Naïve: – Partition the data and run logistic regression for each partition

    – Take the mean of the learned coefficients

    – Problem: Not guaranteed to converge to the model from single machine!

    • Alternating Direction Method of Multipliers (ADMM)– Boyd et al. 2011

    – Set up constraints: each partition’s coefficient = global consensus

    – Solve the optimization problem using Lagrange Multipliers

    – Advantage: guaranteed to converge to a single machine logistic regression on the entire data with reasonable number of iterations

  • Large Scale Logistic Regression via ADMM

    BIG DATA

    Partition 1 Partition 2 Partition 3 Partition K

    Logistic

    Regression

    Logistic

    Regression

    Logistic

    Regression

    Logistic

    Regression

    Consensus

    Computation

    Iteration 1

  • Large Scale Logistic Regression via ADMM

    BIG DATA

    Partition 1 Partition 2 Partition 3 Partition K

    Logistic

    Regression

    Consensus

    Computation

    Logistic

    Regression

    Logistic

    Regression

    Logistic

    Regression

    Iteration 1

  • Large Scale Logistic Regression via ADMM

    BIG DATA

    Partition 1 Partition 2 Partition 3 Partition K

    Logistic

    Regression

    Logistic

    Regression

    Logistic

    Regression

    Logistic

    Regression

    Consensus

    Computation

    Iteration 2

  • Details of ADMM

  • Dual Ascent Method

    • Consider a convex optimization problem

    • Lagrangian for the problem:

    • Dual Ascent:

  • Augmented Lagrangians

    • Bring robustness to the dual ascent method

    • Yield convergence without assumptions like strict convexity or finiteness of f

    • The value of ρ influences the convergence rate

  • Alternating Direction Method of

    Multipliers (ADMM)

    • Problem:

    • Augmented Lagrangians

    • ADMM:

  • Large Scale Logistic Regression via ADMM

    • Notation

    – (Xi , yi): data in the ith partition

    – βi: coefficient vector for partition i

    – β: Consensus coefficient vector

    – r(β): penalty component such as ||β||22

    • Optimization problem

  • ADMM updates

    LOCAL REGRESSIONS

    Shrinkage towards current

    best global estimate

    UPDATED

    CONSENSUS

  • An example implementation

    • ADMM for Logistic regression model fitting with L2/L1 penalty

    • Each iteration of ADMM is a Map-Reduce job– Mapper: partition the data into K partitions

    – Reducer: For each partition, use liblinear/glmnet to fit a L1/L2 logistic regression

    – Gateway: consensus computation by results from all reducers, and sends back the consensus to each reducer node

  • KDD CUP 2010 Data

    • Bridge to Algebra 2008-2009 data in

    https://pslcdatashop.web.cmu.edu/KDDCup/do

    wnloads.jsp

    • Binary response, 20M covariates

    • Only keep covariates with >= 10 occurrences => 2.2M covariates

    • Training data: 8,407,752 samples

    • Test data : 510,302 samples

  • Avg Training Loglikelihood vs Number

    of Iterations

  • Test AUC vs Number of Iterations

  • Better Convergence Can

    Be Achieved By

    • Better Initialization

    – Use results from Naïve method to initialize the

    parameters

    • Adaptively change step size (ρ) for each iteration based on the convergence status of

    the consensus

  • Still…

    • ADMM in hadoop can take hours to converge

    • Is there a better way to handle iterative learning process in Hadoop?

  • Parallel Matrix Factorization

    Deepak Agarwal, Bee-Chung Chen,

    Rajiv Khanna, Liang Zhang

    Applied Relevance Science at LinkedIn

  • Personalized Webpage Is Everywhere

  • Personalized Webpage Is Everywhere

  • Common Properties of Web

    Personalization Problem• One or multiple metrics to optimize

    – Click Through Rate (CTR) (focus of this talk)

    – Revenue per impression

    – Time spent on the landing page

    – Ad conversion rate

    – …

    • Large scale data– MapReduce to solve the problem!

    • Sparsity

    • Cold-start– User features: Age, gender, position, industry, …

    – Item features: Category, key words, creator features, …

  • Problem Setup

    • CTR prediction for a user on an item

    • Assumptions: – There are sufficient data per item to estimate per-item model

    – Serving bias and positional bias are removed by randomly serving scheme

    – Item popularities are quite dynamic and have to be estimated in real-time fashion

    • Examples:– Yahoo! Front page Today module

    – Linkedin Today module

  • Online Logistic Regression (OLR)� User i with feature xi, article j

    � Binary response y (click/non-click)

    � Prior

    � Using Laplace approximation or variational Bayesian methods to obtain posterior

    � New prior

    � Can approximate and as diagonal for high dim xi

  • User Covariates for OLR

    • Age, gender, industry, job position for login users

    • General behavior targeting (BT) covariates– Music? Finance? Politics?

    • User profiles from historical view/click behavior on previous items in the data, e.g.– Item-profile: use previously clicked item ids as the user profile

    – Category-profile: use item category affinity score as profile. The score can be simply user’s historical CTR on each category.

    – Are there better ways to generate user profiles?

    – Yes! By matrix factorization!

  • Generalized Matrix Factorization

    (GMF) Framework

    Global

    Features

    Item

    effect

    User

    factors

    Item

    factors

    User

    effect

    Bell et al. (2007)

  • Regression Priors

    • g(·), h(·), G(·), H(·) can be any regression functions

    • Agarwal and Chen (KDD 2009); Zhang et al. (RecSys 2011)

    User covariates

    Item covariates

  • Different Types of Prior Regression

    Models• Zero prior mean

    – Bilinear random effects (BIRE)

    • Linear regression– Simple regression (RLFM)– Lasso penalty (LASSO)

    • Tree Models– Recursive partitioning (RP)– Random forests (RF)– Gradient boosting machines (GB)– Bayesian additive regression trees (BART)

  • Several Model Fitting Approaches

    • Gibbs Sampling

    • Stochastic Gradient Descent

    • Monte Carlo Expectation-Maximization (MCEM)

    • For now: Single machine only (will discuss Parallel

    Matrix Factorization later)

  • Gibbs Sampling

    • Put additional priors on f(·), g(·), h(·), G(·), H(·), σα, σβ, σu, σv

    • For each iteration, sample full conditional posteriors of α, β, u, v, f, g, h, G, H, σα, σβ, σu, σv

    • Need plenty of iterations to converge and obtain reasonable posterior mean estimates of these parameters

    • When data is large, not feasible for single machine

    • Iterative property of Gibbs Sampling makes parallelization on Hadoop not feasible

  • Stochastic Gradient Descent (SGD)

    • A popular model fitting approach for matrix

    factorization since Netflix competition

    U and V are unknown coefficient matrices for cold-

    start to map xi and xj to low-dim latent space

    • Loss function L =

  • Stochastic Gradient Descent (SGD)

    • Assume data have N samples

    • Loss function L = Σk Lk• For k = 1, 2, …, N do

    • Can run multiple passes of data to achieve convergence

    • ρk is the step size for each observation k

  • • Monte Carlo EM (Booth and Hobert 1999)

    • Let

    • Let

    • E Step:– Obtain N samples of conditional posterior

    • M Step:

    Our Approach: MCEM

  • Handling Binary Responses

    • Gaussian responses:

    have closed form

    • Binary responses + Logistic: no longer closed form

    • Variational approximation (VAR)

    • Adaptive rejection sampling (ARS)

  • Variational Approximation (VAR)

    • Initially let ξij = 1

    • Before each E-step, create pseudo Gaussian response for each binary observation

    • Run E-Step and M-Step using the Gaussian pseudo response

    • After M-step, let

  • Adaptive Rejection Sampling (ARS)

    • For each E-step, obtain precise conditional posterior samples of

    • Adaptively update the upper and lower bound of the log-likelihood to do rejection sampling

  • Simulation Study

    • 10 simulated data sets, 100K samples for both training and test

    • 1000 users and 1000 items in training

    • Extra 500 new users and 500 new items in test + old users/items

    • For each user/item, 200 covariates, only 10 useful

    • Construct non-linear regression model from 20 Gaussian functions for simulating α, β, u and v following Friedman (2001)

  • MovieLens 1M Data Set• 1M ratings with scale 1-5

    • 6040 users

    • 3706 movies

    • Sort by time, first 75% training, last 25% test

    • A lot of new users in the test data set

    • User covariates: Age, gender, occupation, zip code

    • Item covariates: Movie genre

  • Performance Comparison

  • However…

    • We are working with very large scale data sets!

    • Parallel matrix factorization methods using Map-Reduce have to be developed!

    • Khanna et al. 2012 Technical report

  • • Monte Carlo EM (Booth and Hobert 1999)

    • Let

    • Let

    • E Step:– Obtain N samples of conditional posterior

    • M Step:

    Model Fitting Using MCEM

  • Parallel Matrix Factorization

    • Partition data into m partitions

    • For each partition run MCEM algorithm and get .

    • Ensemble runs: for k = 1, … , n– Repartition data into m partitions with a new seed– Run E-step only job for each partition given

    • Average over user/item factors for all partitions and k’s to obtain the final estimate

  • One Map-Reduce

    job

    Parallel Matrix Factorization

    • Partition data into m partitions

    • For each partition run MCEM algorithm and get .

    • Ensemble runs: for k = 1, … , n– Repartition data into m partitions with a new seed– Run E-step only job for each partition given

    • Average over user/item factors for all partitions and k’s to obtain the final estimate

    Each ensemble

    run is a Map-Reduce

    job

  • Key Points

    • Partitioning is tricky!– By events? By items? By users?

    • Empirically, “divide and conquer” + average over to obtain work well!

    • Ensemble runs: After obtained , we run n E-step-only jobs and take average, for each job using a different user-item mix.

  • Identifiability Issues (MCEM-ARSID)

    • Same log-likelihood can be achieved by– g ( ) = g ( ) + r, h ( ) = h ( ) – r

    • Center α, β, u to zero-mean every E-step

    – u = -u, v = -v• Constrain v to be positive

    – Switching u.1, v.1 with u.2, v.2• ui ~ N(G(xi) , I), vj ~ N(H(xj), λI)

    • Constraint: Diagonal entries λ1 >= λ2 >= …

  • MovieLens 1M Data

    • 75% training and 25% test split by time

    • Imbalanced data– User rating = 1: Positive

    – User rating = 2, 3, 4, 5: Negative

    – 5% positive rate

    • Balanced data– User rating = 1, 2, 3: Positive

    – User rating = 4, 5: Negative

    – 44% positive rate

  • Big difference between VAR and ARS for imbalanced data!

  • Matrix Factorization For User Profile

    • Offline user profile building period, obtain the user factor for user i

    • Online modeling using OLR– If a user has a profile (warm-start), use as the

    user covariates

    – If not (cold-start), use as the user covariates

  • Offline Evaluation Metric Related to

    Clicks

    • For model M and J live items (articles) at any time

    • If M = random (constant) modelE[S(M)] = #clicks

    • Unbiased estimate of expected total clicks (Langford et al. 2008)

  • Experiments on Big Data

    • Yahoo! Front Page Today Module data

    • Data for building user profile: 8M users with at least 10 clicks (heavy users) in June 2011, 1B events

    • Data for training and testing OLR model: Random served data with 2.4M clicks in July 2011

    • Heavy users contributed around 30% of clicks

    • User covariates / features for OLR:– Intercept-only (MOST POPULAR)

    – 124 Behavior targeting features (BT-ONLY)

    – BT + top 1000 clicked article ids (ITEM-PROFILE)

    – BT + user profile with CTR on 43 binary content categories (CATEGORY-PROFILE)

    – BT + profiles from matrix factorization models

  • Click Lift Performance For Different

    User Profiles

    Warm Start: Users with at least one sample in training data

    Cold Start: Users with no data in training data

  • Structure of This Tutorial

    • Part I: Introduction to Map-Reduce and the HadoopSystem– Overview of Distributed Computing

    – Introduction to Map-Reduce

    – Introduction to the Hadoop System

    – The Pig Language

    – A Deep Dive of Hadoop Map-Reduce

    • Part II: Examples of Statistical Computing for Big Data– Bag of Little Bootstraps

    – Large Scale Logistic Regression

    – Parallel Matrix Factorization

    • Part III: The Future of Cloud Computing

  • Spark

    • An open source cluster computing system that works with Hadoop HDFS developed in UC Berkeley AMPLab

    • In-memory cluster computing

    • Better than Hadoop for iterative algorithms and interactive data mining

    • Can be 100x faster than Hadoop Map-Reduce for some tasks

    • Code in Scala – easy to write

    • http://spark-project.org/

  • Logistic Regression in Spark vs Hadoop

  • Gradient Descent for Logistic

    Regression

  • Iterative Process in Hadoop

    Disk Mappers Disk Reducers

    DiskMappersDiskReducers

    Disk Mappers Disk Reducers

  • Iterative Process in Spark

    (Gradient Descent Code)

    Data In

    DiskMappers

    Memory

    Reducersgradients

    Mappers Reducers

    Mappers Reducers

    gradients

    gradients

  • GraphLab

    • An open-source graph-based, high performance, distributed computation framework in C++

    • http://graphlab.org/

    • HDFS integration

    • Major design– Sparse data with local dependencies

    – Iterative algorithms

    – Potentially asynchronous execution among nodes

  • GraphLab

    • Graph-parallel

    • Map-Reduce: computation applied to independent records

    • GraphLab: dependent records stored as vertices in a large distributed data-graph

    • Computation in parallel on each vertex and can interact with neighboring vertices

  • Example: PageRank for Web Pages

    Interest: Probability

    of landing on a page

    by random clicking

    ***New Slide!

  • Example: PageRank

    • R[i] = Stationary probability of Node i

    • 1 - α = Probability of people stop clicking at any page

    ***New Slide!

  • Example: PageRank

  • Good For…

    • Bilinear random effect models (matrix factorization in collaborative filtering)

    • Clustering

    • Graphical models

    • Topic modeling

    • Graph analytics

    • …

  • GraphX

    • Combines the advantages of both data-parallel and graph-parallel systems

    • Distributed graph computation on Spark

  • BibliographyAgarwal, D. and Chen, B. (2009). Regression-based latent factor models. In Proceedings of the

    15th ACM SIGKDD international conference on Knowledge discovery and data mining, 19–28.

    ACM.

    Agarwal, D., Chen, B., and Elango, P. (2010). Fast online learning through offline initialization for

    time-sensitive recommendation. In Proceedings of the 16th ACM SIGKDD international

    conference on Knowledge discovery and data mining, 703–712. ACM.

    Bell, R., Koren, Y., and Volinsky, C. (2007). Modeling relationships at multiple scales to improve

    accuracy of large recommender systems. In Proceedings of the 13th ACM SIGKDD international

    conference on Knowledge discovery and data mining, 95–104. ACM.

    Booth, J. G., & Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an

    automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical

    Methodology), 61(1), 265-285.

    Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and

    statistical learning via the alternating direction method of multipliers. Foundations and Trends® in

    Machine Learning, 3(1), 1-122.

    Bickel, P. J., Götze, F., & van Zwet, W. R. (2012). Resampling fewer than n observations: gains,

    losses, and remedies for losses (pp. 267-297). Springer New York.

    Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters.

    Communications of the ACM, 51(1), 107-113.

  • BibliographyEfron, B. (1979). Bootstrap methods: another look at the jackknife. The annals of Statistics, 1-26.

    Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. (2012). The big data bootstrap. arXiv preprint

    arXiv:1206.6415.

    Khanna, R., Zhang, L., Agarwal, D. and Chen, B. (2012). Parallel Matrix Factorization for Binary

    Response. In Arxiv.org.

    Zhang, L., Agarwal, D., and Chen, B. (2011). Generalizing matrix factorization through flexible

    regression priors. In Proceedings of the fifth ACM conference on Recommender systems, 13–20.

    ACM.