Statistical Computing for Big data · Structure of This Tutorial • Part I: Introduction to Map-Reduce and the Hadoop System – Overview of Distributed Computing – Introduction

Statistical Computing

For Big Data

Deepak Agarwal, Liang Zhang

LinkedIn Applied Relevance Science

JSM 2013, Montreal, Canada

Structure of This Tutorial

• Part I: Introduction to Map-Reduce and the HadoopSystem– Overview of Distributed Computing

– Introduction to Map-Reduce

– Introduction to the Hadoop System

– The Pig Language

– A Deep Dive of Hadoop Map-Reduce

• Part II: Examples of Statistical Computing for Big Data– Bag of Little Bootstraps

– Large Scale Logistic Regression

– Parallel Matrix Factorization

• Part III: The Future of Cloud Computing

Big Data becoming Ubiquitous

• Bioinformatics

• Astronomy

• Internet

• Telecommunications

• Climatology

• …

Big Data: Some size estimates

• 1000 human genomes: > 100TB of data (1000 genomes project)

• Sloan Digital Sky Survey: 200GB data per night (>140TB aggregated)

• Facebook: A billion monthly active users

• LinkedIn: 225M members worldwide

• Twitter: 500 million tweets a day

• Over 6 billion mobile phones in the world generating data everyday

Big Data: Paradigm shift

• Classical Statistics– Generalize using small data

• Paradigm Shift with Big Data– We now have an almost infinite supply of data– Easy Statistics ? Just appeal to asymptotic theory?

• So the issue is mostly computational?

– Not quite• More data comes with more heterogeneity

• Need to change our statistical thinking to adapt– Classical statistics still invaluable to think about big data analytics

Some Statistical Challenges

• Exploratory Analysis (EDA), Visualization– Retrospective (on Terabytes)

– More Real Time (streaming computations every few minutes/hours)

• Statistical Modeling– Scale (computational challenge)

– Curse of dimensionality • Millions of predictors, heterogeneity

– Temporal and Spatial correlations

Statistical Challenges continued

• Experiments

– To test new methods, test hypothesis from

randomized experiments

– Adaptive experiments

• Forecasting

– Planning, advertising

• Many more we are not fully well versed in

Defining Big Data

• How to know you have the big data problem?

– Is it only the number of terabytes ?

– What about dimensionality,

structured/unstructured, computations

required,…

• No clear definition, let’s make up one

– When desired computation cannot be completed

in the stipulated time with current best algorithm

using cores available on a commodity PC

– Agree ? Other suggestions ?

Distributed Computing for Big Data

• Distributed computing invaluable tool to scale computations for big data

• Some distributed computing models

– Multi-threading

– Graphics Processing Units (GPU)

– Message Passing Interface (MPI)

– Map-Reduce

Evaluating a method for a problem

• Scalability– Process X GB in Y hours

• Ease of use for a statistician

• Reliability (fault tolerance)

• Cost– Hardware and cost of maintaining

• Good for the computations required?– E.g., Iterative versus one pass

• Resource sharing

Multithreading

• Multiple threads take advantage of multiple CPUs

• Shared memory

• Threads can execute independently and concurrently

• Can only handle Gigabytes of data

• Reliable

Graphics Processing Units (GPU)

• Number of cores:– CPU: Order of 10– GPU: smaller cores

• Order of 1000

• Can be >100x faster than CPU– Parallel computationally intensive tasks off-loaded to GPU

• Good for certain computationally-intensive tasks

• Can only handle Gigabytes of data

• Not trivial to use, requires good understanding of low-level architecture for efficient use

Message Passing Interface (MPI)

• Language independent communication protocol among processes (e.g. computers)

• Most suitable for master/slave model

• Can handle Terabytes of data

• Good for iterative processing

• Fault tolerance is low

Map-Reduce (Dean & Ghemawat,

2004)

Mappers

Reducers

Data

Output

• Computation split to Map (scatter) and Reduce (gather) stages

• Easy to Use: – User needs to implement two

functions: Mapper and Reducer

• Easily handles Terabytes of data

• Very good fault tolerance (failed tasks automatically get restarted)

Comparison of Distributed Computing Methods

Multithreading GPU MPI Map-Reduce

Scalability (data

size)

Gigabytes Gigabytes Terabytes Terabytes

Fault Tolerance High High Low High

Maintenance Cost Low Medium Medium Medium-High

Iterative Process

Complexity

Cheap Cheap Cheap Usually

expensive

Resource Sharing Hard Hard Easy Easy

Easy to Implement? Easy Needs

understanding

of low-level GPU

architecture

Easy Easy

Example Problem

• Tabulating word counts in corpus of documents

• Similar to table function in R

• Single machine: Go through each word in each document and count, however– Documents are stored in different nodes!

– Takes forever for big corpus

• MPI: Each slave node takes a subset of documents; Master node summarizes the result from the slaves, however– Both master node and slave nodes may fail!

Word Count Through Map-Reduce

Hello World

Bye World

Hello Hadoop

Goodbye Hadoop

Mapper 1

Mapper 2

Reducer 1

Words from A-G

Reducer 2

Words from H-Z

Key Ideas about Map-Reduce

Big Data

Partition 1 Partition 2 … Partition N

Mapper 1 Mapper 2 … Mapper N

Reducer 1 Reducer 2 Reducer M…

Output 1 Output 1Output 1Output 1


• Data are split into partitions and stored in many different machines on disk (distributed storage)

• Mappers process data chunks independently and emit pairs

• Data with the same key are sent to the same reducer. One reducer can receive multiple keys

• Every reducer sorts its data by key

• For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output

Compute Mean for Each Group

ID Group No. Score

1 1 0.5

2 3 1.0

3 1 0.8

4 2 0.7

5 2 1.5

6 3 1.2

7 1 0.8

8 2 0.9

9 4 1.3

… … …



• Mappers process data chunks independently and emit pairs– For each row:

• Key = Group No.

• Value = Score

• Data with the same key are sent to the same reducer. One reducer can receive multiple keys– E.g. 2 reducers

– Reducer 1 receives data with key = 1, 2


• Every reducer sorts its data by key– E.g. Reducer 1: ,

• For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output– E.g. Reducer 1 output: ,



• Mappers process data chunks independently and emit pairs– For each row:

• Key = Group No.

• Value = Score

• Data with the same key are sent to the same reducer. One reducer can receive multiple keys– E.g. 2 reducers



• Every reducer sorts its data by key– E.g. Reducer 1: ,

• For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output– E.g. Reducer 1 output: ,

What you need

to implement

Mapper:

Input: Data

for (row in Data)

{

groupNo = row$groupNo;

score = row$score;

Output(c(groupNo, score));

}

Reducer:Reducer:

Input: Key (groupNo), List Value (a list of scores that belong to the Key)

count = 0;

sum = 0;

for (v in Value)

{

sum += v;

count++;

}

Output(c(Key, sum/count));

Pseudo Code (in R)

Exercise 1

• Problem: Average height per {Grade, Gender}?

• What should be the mapper output key?

• What should be the mapper output value?

• What are the reducer input?

• What are the reducer output?

• Write mapper and reducer for this?

Student ID Grade Gender Height (cm)

1 3 M 120

2 2 F 115

3 2 M 116

… … …

• Problem: Average height per Grade and Gender?

• What should be the mapper output key?– {Grade, Gender}

• What should be the mapper output value?– Height

• What are the reducer input?– Key: {Grade, Gender}, Value: List of Heights

• What are the reducer output?– {Grade, Gender, mean(Heights)}


1 3 M 120

2 2 F 115

3 2 M 116

… … …

Exercise 2

• Problem: Number of students per {Grade, Gender}?

• What should be the mapper output key?

• What should be the mapper output value?

• What are the reducer input?

• What are the reducer output?

• Write mapper and reducer for this?


1 3 M 120

2 2 F 115

3 2 M 116

… … …


• What should be the mapper output key?– {Grade, Gender}

• What should be the mapper output value?– 1

• What are the reducer input?– Key: {Grade, Gender}, Value: List of 1’s

• What are the reducer output?– {Grade, Gender, sum(value list)}

– OR: {Grade, Gender, length(value list)}


1 3 M 120

2 2 F 115

3 2 M 116

… … …

More on Map-Reduce

• Depends on distributed file systems

• Typically mappers are the data storage nodes

• Map/Reduce tasks automatically get restarted when they fail (good fault tolerance)

• Map and Reduce I/O are all on disk– Data transmission from mappers to reducers are

through disk copy

• Iterative process through Map-Reduce– Each iteration becomes a map-reduce job– Can be expensive since map-reduce overhead is high

Introduction To The

Hadoop System

The Apache Hadoop System

• An open-source software for reliable, scalable, distributed computing

• The most popular distributed computing system in the world

• Key modules:– Hadoop Distributed File System (HDFS)

– Hadoop YARN (job scheduling and cluster resource management)

– Hadoop MapReduce

Major Tools on Hadoop

• Pig– A high-level language for Map-Reduce computation

• Hive– A SQL-like query language for data querying via Map-Reduce

• Hbase– A distributed & scalable database on Hadoop

– Allows random, real time read/write access to big data

– Voldemort is similar to Hbase

• Mahout– A scalable machine learning library

• …

What is Covered In This Tutorial

• Hadoop Distributed File System (HDFS)

• Pig

• A Deep dive of MapReduce and Hadoop

• Running R on Hadoop

Hadoop Installation

• Setting up Hadoop on your desktop/laptop:

– http://hadoop.apache.org/docs/stable/single_nod

e_setup.html

• Setting up Hadoop on a cluster of machines

– http://hadoop.apache.org/docs/stable/cluster_set

up.html

Hadoop Distributed File System (HDFS)

• Master/Slave architecture

• NameNode: a single master node that controls which data block is stored where.

• DataNodes: slave nodes that store data and do R/W operations

• Clients (Gateway): Allow users to login and interact with HDFS and submit Map-Reduce jobs

• Big data is split to equal-sized blocks, each block can be stored in different DataNodes

• Disk failure tolerance: data is replicated multiple times

A Typical User Session on Hadoop

• Log in to a Hadoop client machine

• Interact with HDFS to– Upload the data from local / Locate where the data is

• Submit a Map-Reduce job

• Debug the job if it fails via Hadoop Job Tracker

• Interact with HDFS to– Check the output of the job

• Copy the output to local for further analysis if needed

• Log out

HDFS Command HDFS FunctionalityAnalogy to Linux Shell

Command

hdfs dfs -ls

returns list of the direct

children of the path and

their stats

ls

hdfs dfs –cat Output file content to

stdoutcat

hdfs dfs –chmod Change Permission chmod

hdfs dfs –copyFromLocal

Copy from lcoal path to hdfs cp

hdfs dfs –copyToLocal

Copy to local path from hdfs cp

hdfs dfs –mkdir Create a directory mkdir

hdfs dfs –rm Remove a file rm

hdfs dfs –rmr Remove a path rm –r

hdfs dfs –mv Moves files from source to

destinationmv

hdfs dfs –cp Copy files from source to

destinationcp

… … …

HDFS Demo Time!

Pig

• A high-level platform for Map-Reduce on Hadoop

• Pig Latin: The SQL-like intuitive language for Pig

• Can be extended via User Defined Functions (UDF) in Java, Python, etc.

Compute Mean for Each Group

ID Group No. Score

1 1 0.5

2 3 1.0

3 1 0.8

4 2 0.7

5 2 1.5

6 3 1.2

7 1 0.8

8 2 0.9

9 4 1.3

… … …

A = LOAD 'Sample-1.dat' USING PigStorage() AS (ID : int, groupNo: int, score: float);

B = GROUP A BY groupNo;

C = FOREACH B GENERATE group, AVG(A.score) AS mean;

DUMP C;

ID Group No. Score

1 1 0.5

2 3 1.0

3 1 0.8

4 2 0.7

5 2 1.5

6 3 1.2

7 1 0.8

8 2 0.9

9 4 1.3

… … …

File Sample-1.dat

Load the Data into Pig

• A = LOAD 'Sample-1.dat' USING PigStorage() AS (ID : int, groupNo: int, score: float);– The path of the data on HDFS after LOAD

• USING PigStorage() means delimit the data by tab (can be omitted)

• If data are delimited by other characters, e.g. space, use USING PigStorage(' ')

• Data schema defined after AS

• Variable types: int, long, float, double, chararray, …

Tuple

• A Tuple is a data record consisting of a sequence of "fields”

• Each field is a piece of data of any type

• E.g. Each row of A is a Tuple:

{ID: int, groupNo: int, score: float}

Data Bag

• A Data Bag is a set of tuples

• You may think of it as a "table”

• Pig does not require that the tuple field types match, or even that the tuples have the same

number of fields

Filter Data

• D = FILTER A BY score>0.1 AND score

Per Row Operations

• E = FOREACH A GENERATE groupNo, score, score * score AS scoreSq;

• FOREACH…GENERATE does independent per-row operations

• Column “ID” is thrown away

• A new column “ScoreSq” was generated by simply doing score * score

GROUP…BY…

• B = GROUP A BY groupNo;

• A: {ID: int, groupNo: int, score: float}

• B: {group: int, A: {(ID: int, groupNo: int, score: float)}

• Results of GROUP…BY… always have two fields– Field “group” saves the information of the group.

Here, it is the same as groupNo.

– The second field is a data bag named as the variable name after GROUP (e.g. A here), which consists of all the tuples (rows) for this group (e.g. with the same groupNo)

AVG (Average) Operation

• B = GROUP A BY groupNo;

• C = FOREACH B GENERATE group, AVG(A.score) AS mean;

• B: {group: int, A: {(ID: int, groupNo: int, score: float)}

• C: {group: int, mean: double}

• AVG function computes the average of the numeric values in a single-column of data bag

• A.score retrieves the “score” column in the field “A” of data B and forms a new data bag

Other Sample Functions on Data Bags

• SUM: Computes the sum of the numeric values in a single-column bag.

• COUNT: Counts the number of tuples in a data bag

• MIN / MAX: Computes the min/max of the numeric values in a single-column bag.

• IsEmpty: Checks if a data bag is empty

• …

Pig Output

• DUMP function outputs on screen

• STORE outputs to HDFS storage

• For example– DUMP C– STORE C into 'output/Output-1' using PigStorage()– Pig stores the data C into directory with path

output/Output-1

• STORE creates the directory if it doesn’t exist– The job fails if the directory already exists– Use rmf before STORE function to force

overwrite

Pig Demo 1



1 3 M 120

2 2 F 115

3 2 M 116

… … …

File Sample-2.dat

A = LOAD 'Sample-2.dat' USING PigStorage() AS (studentID : int, grade: int, gender:

chararray, height: int);

B = GROUP A BY (grade, gender);

C = FOREACH B GENERATE FLATTEN(group) AS (grade, gender), COUNT(A) AS

numStudents;

DUMP C;

GROUP…BY…

• B = GROUP A BY (grade, gender);

• A: {studentID : int, grade: int, gender: chararray, height: int}

• B: {group: (grade: int, gender: chararray),A: {(studentID: int, grade: int, gender: chararray,

height: int)}}

• The “group” field in B now becomes a Tuple with two fields: grade and gender

FLATTEN

• FLATTEN has to be used with FOREACH…GENERATE…

• FLATTEN operates on Tuples or Data Bags

• FLATTEN(a field that is a Tuple) => a set of fields– E.g. A: {a: int, b: (c: int, d: chararray)}

– B = FOREACH A GENERATE a, FLATTEN(b);

– B: {a: int, b::c: int, b::d: chararray}

– B = FOREACH A GENERATE a, FLATTEN(b) AS (c:int, d:chararray);

– B: {a: int, c:int, d:chararray}

FLATTEN

• FLATTEN(a DataBag) generates N records if the bag has N tuples

• It cross products with the other fields in the data

• For example– A: {a: int, b: {(c: int, d: chararray)}}

– The field b is a Data Bag that contains Tuples with two fields: c and d

– B = FOREACH A GENERATE a, FLATTEN(b) as (c:int, d:chararray);

A

(1, {(2,M), (5,F)})

(3, {(4,F), (6,M), (7,F)}

B

(1, 2, M)

(1, 5, F)

(3, 4, F)

(3, 6, M)

(3, 7, F)

Pig Demo 2

JOINA = LOAD 'Sample

(

A = LOAD 'Sample-3a.dat' AS

(a1:int,a2:int,a3:int);

DUMP A;

(1,2,3)

(4,2,1)

(8,3,4)

(4,3,3)

(7,2,5)

(8,4,3)

B = LOAD AS (b1:int,b2:int);

(4,9)

B = LOAD 'Sample-3b.dat' AS (b1:int,b2:int);

DUMP B;

(2,4)

(8,9)

(1,3)

(2,7)

(2,9)

(4,6)

(4,9)

X = JOIN A BY a1, B BY b1;

(


DUMP X;

(1,2,3,1,3)

(4,2,1,4,6)

(4,3,3,4,6)

(4,2,1,4,9)

(4,3,3,4,9)

(8,3,4,8,9)

(8,4,3,8,9)


(



DUMP A;

(1,2,3)

(4,2,1)

(8,3,4)

(4,3,3)

(7,2,5)

(8,4,3)


(4,9)


DUMP B;

(2,4)

(8,9)

(1,3)

(2,7)

(2,9)

(4,6)

(4,9)


(


DUMP X;

(1,2,3,1,3)

(4,2,1,4,6)

(4,3,3,4,6)

(4,2,1,4,9)

(4,3,3,4,9)

(8,3,4,8,9)

(8,4,3,8,9)

Cross-

Product


(



DUMP A;

(1,2,3)

(4,2,1)

(8,3,4)

(4,3,3)

(7,2,5)

(8,4,3)


(4,9)


DUMP B;

(2,4)

(8,9)

(1,3)

(2,7)

(2,9)

(4,6)

(4,9)


(


DUMP X;

(1,2,3,1,3)

(4,2,1,4,6)

(4,3,3,4,6)

(4,2,1,4,9)

(4,3,3,4,9)

(8,3,4,8,9)

(8,4,3,8,9)

Records that do not

match got thrown

away

COGROUPA = LOAD 'Sample

(



DUMP A;

(1,2,3)

(4,2,1)

(8,3,4)

(4,3,3)

(7,2,5)

(8,4,3)


(4,9)


DUMP B;

(2,4)

(8,9)

(1,3)

(2,7)

(2,9)

(4,6)

(4,9)

W = COGROUP A BY a1, B BY b1;

DUMP W;

(1,{(1,2,3)},{(1,3)})

(2,{},{(2,4),(2,7),(2,9)})

(4,{(4,2,1),(4,3,3)},{(4,6),(4,9)})

(7,{(7,2,5)},{})

(8,{(8,3,4),(8,4,3)},{(8,9)})

DESCRIBE W;

W: {group: int, A: {(a1: int,a2: int,a3: int)}, B: {(b1: int,b2: int)}}

COGROUP and JOIN

• X = JOIN A BY a1, B BY b1;

• X: {A::a1: int,A::a2: int,A::a3: int,B::b1: int,B::b2: int}

• W = COGROUP A BY a1, B BY b1;

• X = FOREACH W GENERATE FLATTEN(A), FLATTEN(B);


COGROUP and JOIN






Equivalent

COGROUP and JOIN






Equivalent

COGROUP is more

flexible

Pig Demo 3

• Students at schools take tests (e.g. SAT)

• One student can take the same test multiple times

• We take Max(scores) from one student as his/her final score

• Problem: The top 3 schools with most number of students having final score > 90?

Student ID School Score

1 B 89

1 B 90

3 B 75

2 A 60

4 C 90

… … …

File Sample-4.dat


A = LOAD A = LOAD 'Sample-4.dat' AS (studentID: int, school: chararray, score:

int);

B = GROUP A BY school;

C = FOREACH B {

D = FILTER A BY score>90;

E = DISTINCT D.studentID;

GENERATE FLATTEN(group) AS school, COUNT(E) AS count;

}

F = ORDER C BY count DESC;

G = LIMIT F 3;

DUMP G;



int);


C = FOREACH B {




}


G = LIMIT F 3;

DUMP G;

• Nested Block for FOREACH…GENERATE

• The last statement in the nested block must be GENERATE.



int);


C = FOREACH B {




}


G = LIMIT F 3;

DUMP G;

• B: {group: int, A: {(studentID: int, school: chararray, score: int)}}

• For each record in B, A is a field that is a Data Bag that stores the

data from the group (school).

• D filters the Data Bag A by score>90 and generates a new Data Bag

• E takes the student ID field from D, forms a new Data Bag with

single field tuples (D.studentID), and performs “unique” operation

by DINSTINCT



int);


C = FOREACH B {




}


G = LIMIT F 3;

DUMP G;

• ORDER…BY is the sort function implemented in map-reduce

• DESC means sort in the descending order. If without DESC, it means

ascending order.

• Q: What does this mean?

H = ORDER A by score DESC, studentID;



int);


C = FOREACH B {




}


G = LIMIT F 3;

DUMP G;

• LIMIT operation here picks the top 3 records from all the records

Pig Demo 4

User Defined Functions in Pig (UDF)

• Supports customized computation in mapper/reducer stages

• Best to write in Java

• Also supports Python

Calling Java UDF in Pig

• Example: Change the name field to upper case

REGISTER myudfs.jar;

A = LOAD 'student_data' AS (name: chararray, age:

int, gpa: float);

B = FOREACH A GENERATE myudfs.UPPER(name);

DUMP B;





int, gpa: float);


DUMP B;

Register the jar file that contains

The UDF java class





int, gpa: float);


DUMP B;

Calling the UDF to change name

to upper case

package

}

package myudfs;

import java.io.IOException;

import org.apache.pig.EvalFunc;

import org.apache.pig.data.Tuple;

import org.apache.pig.impl.util.WrappedIOException;

public class UPPER extends EvalFunc (String) {

public String exec(Tuple input) throws IOException {

if (input == null || input.size() == 0) return null;

try{

String str = (String) input.get(0);

return str.toUpperCase();

} catch(Exception e){

throw WrappedIOException.wrap("Caught

exception processing input row ", e);

}

}

}

package

}

package myudfs;








try{






}

}

}

Used in

FOREACH…GENERA

TE context

package

}

package myudfs;








try{






}

}

}

Output type

package

}

package myudfs;








try{






}

}

}

Input type is always Tuple, which is a Tuple of parameters of the

function

package

}

package myudfs;








try{






}

}

}

For this example, only 1 parameter of the

function, which is the string to be upper cased

package

}

package myudfs;








try{






}

}

}

Return the upper-

cased string

Summary of Pig

• A really nice interactive tool for HadoopMapReduce beginners

• Efficient for quick statistical data analysis on big data (very little coding time)

• UDF support enables comprehensive MapReduce jobs

A Deep Dive into Hadoop

Map-Reduce

Hadoop Map-Reduce in Java

• A natural way for implementing Map-Reduce since Hadoop is written in Java

• A Map-Reduce job class contains at least– A Mapper class

– A Reducer class

– A main function that sets up the configs and runs the job

• More info: http://hadoop.apache.org/docs/stable/mapred_tutorial.html


Hello World

Bye World

Hello Hadoop

Goodbye Hadoop

Mapper 1

Mapper 2

Reducer 1

Words from A-G

Reducer 2

Words from H-Z


Hello World

Bye World

Hello Hadoop

Goodbye Hadoop

Mapper 1

Mapper 2

Reducer 1

Words from A-G

Reducer 2

Words from H-Z

Large amount of data

to transfer from mappers

to reducers for large data

How About…

Hello World

Bye World

Hello Hadoop

Goodbye Hadoop

Mapper 1

Mapper 2

Reducer 1

Words from A-G

Reducer 2

Words from H-Z

Combiners

• A reducer-like class that happens in mapper stage to aggregate data to sufficient statistics

before sending to reducers

• Can significantly improve Hadoop Map-Reduce performance

• Pig: Automatically applies combiners

Hello World

Bye World

Hello Hadoop

Goodbye Hadoop

Mapper 1

Mapper 2

Reducer 1

Words from A-G

Reducer 2

Words from H-Z

Combiner 1

Combiner 2

Partitioners

• Determines which pair is sent to which reducer

• Default: random partitioning based on key’s hash value

• Can be overwritten according to customized needs

Number of Reducers

• Too small– The job runs forever

• Too big– Waiting time to obtain many reducer nodes and starting those nodes can be

long

– A set of tiny output files cause• Hadoop namespace usage to be inefficient

• Jobs who read this output as input use unnecessary large number of mappers

• Ideally– Each reducer runs for at least several minutes, but not too long

– Output partition size is at least a few hundred MB

• In Pig– B = GROUP A BY name PARALLEL 10;

– C = COGROUP A BY name, B BY name PARALLEL 10;

– B = ORDER A BY NAME PARALLEL 10;

– …

More Efficient Hadoop Jobs

• Why combiners?– Use combiners if necessary to reduce the amount

of data transferred from mappers to reducers

• Why partitioners?– Use partitioners to make sure the amount of

computation time for each reducer is similar

• Why optimizing number of reducers?– Optimize Map-Reduce running time

– Optimize namespace usage

More Efficient Pig Scripts

• Remove unnecessary fields early and often using FOREACH…GENERATE

• Filter early and often

• COGROUP and JOIN: smaller files always on the left

– A = join small by x, large by y;

• Use PARALLEL to tune the parallelism of the job (Q: Why?)

Hadoop Streaming

• A utility to create and run Map-Reduce with any executable or script as mapper/reducer

• For example

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

-input myInputDirs \

-output myOutputDir \

-mapper myPythonScript.py \

-reducer /bin/wc

-file myPythonScript.py

Hadoop Streaming

• A utility to create and run Map-Reduce with any executable or script as mapper/reducer

• For example




-mapper myPythonScript.py \

-reducer /bin/wc

-file myPythonScript.py

The python script shipped to clusters during job

submission

Running R using Hadoop Streaming

• Install R on Hadoop gateway machines

• Package R dir into R.jar by "jar cvf R.jar -C R/ .”

• Copy R.jar from local to Hadoop




-mapper "myR/bin/R --vanilla --slave --no-restore -f mapper.R" \

-reducer "myR/bin/R --vanilla --slave --no-restore -f reducer.R" \

-file mapper.R \

-file reducer.R \

-cacheArchive 'hdfs://[nameNodeURL]:9000/[HDFS location]/R.jar#myR' \

-cmdenv R_HOME_DIR=myR \

Running R Using Hadoop

Streaming

• cacheArchive ships R.jar to each mapper/reducer node, unzips the jar to

directory myR, and execute R from myR/bin/R




-mapper "myR/bin/R --vanilla --slave --no-restore -f mapper.R" \

-reducer "myR/bin/R --vanilla --slave --no-restore -f reducer.R" \

-file mapper.R \

-file reducer.R \

-cacheArchive 'hdfs://[nameNodeURL]:9000/[HDFS location]/R.jar#myR' \

-cmdenv R_HOME_DIR=myR \

RHive

• An R package that interacts with Hive (SQL-like in Hadoop) from R

• Connects R objects and functions with Hive

• http://cran.r-project.org/web/packages/RHive/RHive.pdf

RHIPE

• Another R package that aims to interact with Hadoop HDFS and Map-Reduce

• Sample R functions– rhput: Put a file to HDFS– rhget: Copy a file to local from HDFS– rhmr: Prepares a Map-Reduce job for execution– rhex: Execute a Map-Reduce job– rhkill: Kill a Map-Reduce job– …

• http://www.datadr.org/getpack.html

Summary of Part I

• Big Data Overview: Statistical perspective

• Map-Reduce : An increasingly popular computing system to analyze big data

• Hadoop System: An open source implementation of Map-Reduce

• Pig: High level Hadoop Map-Reduce language (like R for big data)

Bag of Little Bootstraps

Kleiner et al. 2012

Bootstrap (Efron, 1979)

• A re-sampling based method to obtain statistical distribution of sample estimators

• Why are we interested ?– Re-sampling is embarrassingly parallelizable

• For example:– Standard deviation of the mean of N samples (μ)

– For i = 1 to r do • Randomly sample with replacement N times from the original

sample -> bootstrap data i

• Compute mean of the i-th bootstrap data -> μi– Estimate of Sd(μ) = Sd([μ1,…μr])

– r is usually a large number, e.g. 200

Bootstrap for Big Data

• Can have r nodes running in parallel, each sampling one bootstrap data

• However…

– N can be very large

– Data may not fit into memory

– Collecting N samples with replacement on each

node can be computationally expensive

M out of N Bootstrap

(Bikel et al. 1997)

• Obtain SdM(μ) by sampling M samples with

replacement for each bootstrap, where M

Bag of Little Bootstraps (BLB)

• Example: Standard deviation of the mean • Generate S sampled data sets, each obtained by random

sampling without replacement a subset of size b (or partition the original data into S partitions, each with size b)

• For each data p = 1 to S do– For i = 1 to r do

• N samples with replacement on data of size b

• Compute mean of the resampled data μpi– Compute Sdp(μ) = Sd([μp1,…μpr])

• Estimate of Sd(μ) = Avg([Sd1(μ),…, SdS(μ)])


• Interest: ξ(θ), where θ is an estimate obtained from size N data– ξ is some function of θ, such as standard deviation, …

• Generate S sampled data sets, each obtained from random sampling without replacement a subset of size b (or partition the original data into S partitions, each with size b)


• Sample N samples with replacement on data of size b• Compute the estimate of the resampled data -> θpi

– Compute ξp(θ) = ξ([θp1,…θpr])

• Estimate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])

***Typo in hand out!


• Interest: ξ(θ), where θ is an estimate obtained from size N data– ξ is some function of θ, such as standard deviation, …

• Generate S sampled data sets, each obtained from random sampling without replacement a subset of size b (or partition the original data into S partitions, each with size b)


• Sample N samples with replacement on the data of size b• Compute the estimate of the resampled data -> θpi

– Compute ξp(θ) = ξ([θp1,…θpr])

• Estimate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])

MapperReducer

Gateway

***Typo in hand out!

Why is BLB Efficient

• Before:– N samples with replacement from size N data is

expensive when N is large

• Now:– N samples with replacement from size b data

– b can be several magnitude smaller than N (e.g. b = Nγ, γ in [0.5, 1))

– Equivalent to: A multinomial sampler with dim = b

– Storage = O(b), Computational complexity = O(b)

Simulation Experiment

• 95% CI of Logistic Regression Coefficients

• N = 20000, 10 explanatory variables

• Relative Error = |Estimated CI width – True CI width | / True CI width

• BLB-γ: BLB with b = Nγ

• BOFN-γ: b out of N sampling with b = Nγ

• BOOT: Naïve bootstrap

Simulation Experiment

Real Data

• 95% CI of Logistic Regression Coefficients

• N = 6M, 3000 explanatory variables

• Data size = 150GB, r = 50, S = 5, γ = 0.7

Data stored on disk Data stored in memory using Spark

Hyper-parameter Selection

• From empirical experiments in the paper

– S>=3 and r>=50 is sufficient for low relative errors

• Adaptively selecting r and S

– Increase r and s until estimated value converges

– For r: Continue to process resamples and update ξp(θ) until it has ceased to change significantly.

– For S: Continue to process more subsamples (i.e., increasing S) until BLB’s output value has stabilized.

***New Slide!

Hyper-parameter Selection

BLB-0.7

• Adaptively selecting r and S

– Increase r and S until estimated value converges

Adaptive BLB

Summary of BLB

• A new algorithm for bootstrapping on big data

• Advantages

– Fast and efficient

– Easy to parallelize

– Easy to understand and implement

– Friendly to Hadoop, makes it routine to perform

statistical calculations on Big data

Large Scale Logistic Regression

Deepak Agarwal, Bee-Chung Chen, Bo Long,

Liang Zhang, Xianxing Zhang

Applied Relevance Science at LinkedIn

Logistic Regression

• Binary response: Y

• Covariates: X

• Yi ~ Bernoulli(pi)

• log(pi/(1-pi)) = XiTβ ; β ~ MVN(0 , 1/λ I )

• Widely used (research and applications)

item j from a set of candidates

User i

with

user features(e.g., industry,

behavioral features,

Demographic

features,……)

(i, j) : response yijvisits

Algorithm selects

(click or not)

Which item should we recommend to the user?

�� The item with highest predicted response rate

• Logistic Regression effective technique

Response Prediction: Application of logistic regression

in recommender systems

Examples of Recommender Systems

Similar problem:

Content recommendation on Yahoo!

front page

Recommend content links

(out of 30-40, editorially

programmed)

4 slots exposed, F1 has

maximum exposure

Routes traffic to other Y! properties

F1 F2 F3 F4

Today module

Right Media Ad Exchange: Unified Marketplace

Match ads to page views on publisher sites

Has ad

impression to

sell --

AUCTIONS

Bids $0.50Bids $0.75 via Network…

… which becomes

$0.45 bid

Bids $0.65—WINS!

AdSenseAd.com

Bids $0.60

Logistic Regression for

Response Prediction• Binary response: Y

– Click / Non-click

• Covariates: X– User covariates:

• Age, gender, industry, education, job, job title, …

– Item covariates:• Categories, keywords, topics, …

– Context covariates:• Time, page type, position, …

– 2-way interaction:• User covariates X item covariates• Context covariates X item covariates• …

Computational Challenge

• Hundreds of millions/billions of observations

• Hundreds of thousands/millions of covariates

• Fitting such a logistic regression model on a single machine not feasible

• Model fitting iterative using methods like gradient descent, Newton’s method etc

– Multiple passes over the data

Recap on Optimization method

• Problem: Find x to min(F(x))

• Iteration n: xn = xn-1 – bn-1 F’(xn-1)

• bn-1 is the step size that can change every iteration

• Iterate until convergence

• Conjugate gradient, LBFGS, Newton trust region, … all of this kind

Iterative Process with Hadoop

Disk Mappers Disk Reducers

DiskMappersDiskReducers


Limitations of Hadoop for fitting a big

logistic regression

• Iterative process is expensive and slow

• Every iteration = a Map-Reduce job

• I/O of mapper and reducers are both through disk

• Plus: Waiting in queue time

• Q: Can we find a fitting method that scales with Hadoop ?

Large Scale Logistic Regression

• Naïve: – Partition the data and run logistic regression for each partition

– Take the mean of the learned coefficients

– Problem: Not guaranteed to converge to the model from single machine!

• Alternating Direction Method of Multipliers (ADMM)– Boyd et al. 2011

– Set up constraints: each partition’s coefficient = global consensus

– Solve the optimization problem using Lagrange Multipliers

– Advantage: guaranteed to converge to a single machine logistic regression on the entire data with reasonable number of iterations

Large Scale Logistic Regression via ADMM

BIG DATA

Partition 1 Partition 2 Partition 3 Partition K

Logistic

Regression

Logistic

Regression

Logistic

Regression

Logistic

Regression

Consensus

Computation

Iteration 1


BIG DATA


Logistic

Regression

Consensus

Computation

Logistic

Regression

Logistic

Regression

Logistic

Regression

Iteration 1


BIG DATA


Logistic

Regression

Logistic

Regression

Logistic

Regression

Logistic

Regression

Consensus

Computation

Iteration 2

Details of ADMM

Dual Ascent Method

• Consider a convex optimization problem

• Lagrangian for the problem:

• Dual Ascent:

Augmented Lagrangians

• Bring robustness to the dual ascent method

• Yield convergence without assumptions like strict convexity or finiteness of f

•

• The value of ρ influences the convergence rate

Alternating Direction Method of

Multipliers (ADMM)

• Problem:

• Augmented Lagrangians

• ADMM:


• Notation

– (Xi , yi): data in the ith partition

– βi: coefficient vector for partition i

– β: Consensus coefficient vector

– r(β): penalty component such as ||β||22

• Optimization problem

ADMM updates

LOCAL REGRESSIONS

Shrinkage towards current

best global estimate

UPDATED

CONSENSUS

An example implementation

• ADMM for Logistic regression model fitting with L2/L1 penalty

• Each iteration of ADMM is a Map-Reduce job– Mapper: partition the data into K partitions

– Reducer: For each partition, use liblinear/glmnet to fit a L1/L2 logistic regression

– Gateway: consensus computation by results from all reducers, and sends back the consensus to each reducer node

KDD CUP 2010 Data

• Bridge to Algebra 2008-2009 data in

https://pslcdatashop.web.cmu.edu/KDDCup/do

wnloads.jsp

• Binary response, 20M covariates

• Only keep covariates with >= 10 occurrences => 2.2M covariates

• Training data: 8,407,752 samples

• Test data : 510,302 samples

Avg Training Loglikelihood vs Number

of Iterations

Test AUC vs Number of Iterations

Better Convergence Can

Be Achieved By

• Better Initialization

– Use results from Naïve method to initialize the

parameters

• Adaptively change step size (ρ) for each iteration based on the convergence status of

the consensus

Still…

• ADMM in hadoop can take hours to converge

• Is there a better way to handle iterative learning process in Hadoop?

Parallel Matrix Factorization

Deepak Agarwal, Bee-Chung Chen,

Rajiv Khanna, Liang Zhang

Applied Relevance Science at LinkedIn

Personalized Webpage Is Everywhere

Common Properties of Web

Personalization Problem• One or multiple metrics to optimize

– Click Through Rate (CTR) (focus of this talk)

– Revenue per impression

– Time spent on the landing page

– Ad conversion rate

– …

• Large scale data– MapReduce to solve the problem!

• Sparsity

• Cold-start– User features: Age, gender, position, industry, …

– Item features: Category, key words, creator features, …

Problem Setup

• CTR prediction for a user on an item

• Assumptions: – There are sufficient data per item to estimate per-item model

– Serving bias and positional bias are removed by randomly serving scheme

– Item popularities are quite dynamic and have to be estimated in real-time fashion

• Examples:– Yahoo! Front page Today module

– Linkedin Today module

Online Logistic Regression (OLR)� User i with feature xi, article j

� Binary response y (click/non-click)

�

�

� Prior

� Using Laplace approximation or variational Bayesian methods to obtain posterior

� New prior

� Can approximate and as diagonal for high dim xi

User Covariates for OLR

• Age, gender, industry, job position for login users

• General behavior targeting (BT) covariates– Music? Finance? Politics?

• User profiles from historical view/click behavior on previous items in the data, e.g.– Item-profile: use previously clicked item ids as the user profile

– Category-profile: use item category affinity score as profile. The score can be simply user’s historical CTR on each category.

– Are there better ways to generate user profiles?

– Yes! By matrix factorization!

Generalized Matrix Factorization

(GMF) Framework

•

Global

Features

Item

effect

User

factors

Item

factors

User

effect

Bell et al. (2007)

Regression Priors

•

• g(·), h(·), G(·), H(·) can be any regression functions

• Agarwal and Chen (KDD 2009); Zhang et al. (RecSys 2011)

User covariates

Item covariates

Different Types of Prior Regression

Models• Zero prior mean

– Bilinear random effects (BIRE)

• Linear regression– Simple regression (RLFM)– Lasso penalty (LASSO)

• Tree Models– Recursive partitioning (RP)– Random forests (RF)– Gradient boosting machines (GB)– Bayesian additive regression trees (BART)

Several Model Fitting Approaches

• Gibbs Sampling

• Stochastic Gradient Descent

• Monte Carlo Expectation-Maximization (MCEM)

• For now: Single machine only (will discuss Parallel

Matrix Factorization later)

Gibbs Sampling

• Put additional priors on f(·), g(·), h(·), G(·), H(·), σα, σβ, σu, σv

• For each iteration, sample full conditional posteriors of α, β, u, v, f, g, h, G, H, σα, σβ, σu, σv

• Need plenty of iterations to converge and obtain reasonable posterior mean estimates of these parameters

• When data is large, not feasible for single machine

• Iterative property of Gibbs Sampling makes parallelization on Hadoop not feasible

Stochastic Gradient Descent (SGD)

• A popular model fitting approach for matrix

factorization since Netflix competition

•

U and V are unknown coefficient matrices for cold-

start to map xi and xj to low-dim latent space

• Loss function L =

Stochastic Gradient Descent (SGD)

• Assume data have N samples

• Loss function L = Σk Lk• For k = 1, 2, …, N do

• Can run multiple passes of data to achieve convergence

• ρk is the step size for each observation k

• Monte Carlo EM (Booth and Hobert 1999)

• Let

• Let

• E Step:– Obtain N samples of conditional posterior

• M Step:

Our Approach: MCEM

Handling Binary Responses

• Gaussian responses:

have closed form

• Binary responses + Logistic: no longer closed form

• Variational approximation (VAR)

• Adaptive rejection sampling (ARS)

Variational Approximation (VAR)

• Initially let ξij = 1

• Before each E-step, create pseudo Gaussian response for each binary observation

• Run E-Step and M-Step using the Gaussian pseudo response

• After M-step, let

Adaptive Rejection Sampling (ARS)

• For each E-step, obtain precise conditional posterior samples of

• Adaptively update the upper and lower bound of the log-likelihood to do rejection sampling

Simulation Study

• 10 simulated data sets, 100K samples for both training and test

• 1000 users and 1000 items in training

• Extra 500 new users and 500 new items in test + old users/items

• For each user/item, 200 covariates, only 10 useful

• Construct non-linear regression model from 20 Gaussian functions for simulating α, β, u and v following Friedman (2001)

MovieLens 1M Data Set• 1M ratings with scale 1-5

• 6040 users

• 3706 movies

• Sort by time, first 75% training, last 25% test

• A lot of new users in the test data set

• User covariates: Age, gender, occupation, zip code

• Item covariates: Movie genre

Performance Comparison

However…

• We are working with very large scale data sets!

• Parallel matrix factorization methods using Map-Reduce have to be developed!

• Khanna et al. 2012 Technical report

• Monte Carlo EM (Booth and Hobert 1999)

• Let

• Let

• E Step:– Obtain N samples of conditional posterior

• M Step:

Model Fitting Using MCEM


• Partition data into m partitions

• For each partition run MCEM algorithm and get .

•

• Ensemble runs: for k = 1, … , n– Repartition data into m partitions with a new seed– Run E-step only job for each partition given

• Average over user/item factors for all partitions and k’s to obtain the final estimate

One Map-Reduce

job


• Partition data into m partitions

• For each partition run MCEM algorithm and get .

•

• Ensemble runs: for k = 1, … , n– Repartition data into m partitions with a new seed– Run E-step only job for each partition given

• Average over user/item factors for all partitions and k’s to obtain the final estimate

Each ensemble

run is a Map-Reduce

job

Key Points

• Partitioning is tricky!– By events? By items? By users?

• Empirically, “divide and conquer” + average over to obtain work well!

• Ensemble runs: After obtained , we run n E-step-only jobs and take average, for each job using a different user-item mix.

Identifiability Issues (MCEM-ARSID)

• Same log-likelihood can be achieved by– g ( ) = g ( ) + r, h ( ) = h ( ) – r

• Center α, β, u to zero-mean every E-step

– u = -u, v = -v• Constrain v to be positive

– Switching u.1, v.1 with u.2, v.2• ui ~ N(G(xi) , I), vj ~ N(H(xj), λI)

• Constraint: Diagonal entries λ1 >= λ2 >= …

MovieLens 1M Data

• 75% training and 25% test split by time

• Imbalanced data– User rating = 1: Positive

– User rating = 2, 3, 4, 5: Negative

– 5% positive rate

• Balanced data– User rating = 1, 2, 3: Positive

– User rating = 4, 5: Negative

– 44% positive rate

Big difference between VAR and ARS for imbalanced data!

Matrix Factorization For User Profile

• Offline user profile building period, obtain the user factor for user i

• Online modeling using OLR– If a user has a profile (warm-start), use as the

user covariates

– If not (cold-start), use as the user covariates

Offline Evaluation Metric Related to

Clicks

• For model M and J live items (articles) at any time

• If M = random (constant) modelE[S(M)] = #clicks

• Unbiased estimate of expected total clicks (Langford et al. 2008)

Experiments on Big Data

• Yahoo! Front Page Today Module data

• Data for building user profile: 8M users with at least 10 clicks (heavy users) in June 2011, 1B events

• Data for training and testing OLR model: Random served data with 2.4M clicks in July 2011

• Heavy users contributed around 30% of clicks

• User covariates / features for OLR:– Intercept-only (MOST POPULAR)

– 124 Behavior targeting features (BT-ONLY)

– BT + top 1000 clicked article ids (ITEM-PROFILE)

– BT + user profile with CTR on 43 binary content categories (CATEGORY-PROFILE)

– BT + profiles from matrix factorization models

Click Lift Performance For Different

User Profiles

Warm Start: Users with at least one sample in training data

Cold Start: Users with no data in training data

Spark

• An open source cluster computing system that works with Hadoop HDFS developed in UC Berkeley AMPLab

• In-memory cluster computing

• Better than Hadoop for iterative algorithms and interactive data mining

• Can be 100x faster than Hadoop Map-Reduce for some tasks

• Code in Scala – easy to write

• http://spark-project.org/

Logistic Regression in Spark vs Hadoop

Gradient Descent for Logistic

Regression

Iterative Process in Hadoop


DiskMappersDiskReducers


Iterative Process in Spark

(Gradient Descent Code)

Data In

DiskMappers

Memory

Reducersgradients

Mappers Reducers

Mappers Reducers

gradients

gradients

GraphLab

• An open-source graph-based, high performance, distributed computation framework in C++

• http://graphlab.org/

• HDFS integration

• Major design– Sparse data with local dependencies

– Iterative algorithms

– Potentially asynchronous execution among nodes

GraphLab

• Graph-parallel

• Map-Reduce: computation applied to independent records

• GraphLab: dependent records stored as vertices in a large distributed data-graph

• Computation in parallel on each vertex and can interact with neighboring vertices

Example: PageRank for Web Pages

Interest: Probability

of landing on a page

by random clicking

***New Slide!

Example: PageRank

• R[i] = Stationary probability of Node i

• 1 - α = Probability of people stop clicking at any page

***New Slide!

Example: PageRank

Good For…

• Bilinear random effect models (matrix factorization in collaborative filtering)

• Clustering

• Graphical models

• Topic modeling

• Graph analytics

• …

GraphX

• Combines the advantages of both data-parallel and graph-parallel systems

• Distributed graph computation on Spark

BibliographyAgarwal, D. and Chen, B. (2009). Regression-based latent factor models. In Proceedings of the

15th ACM SIGKDD international conference on Knowledge discovery and data mining, 19–28.

ACM.

Agarwal, D., Chen, B., and Elango, P. (2010). Fast online learning through offline initialization for

time-sensitive recommendation. In Proceedings of the 16th ACM SIGKDD international

conference on Knowledge discovery and data mining, 703–712. ACM.

Bell, R., Koren, Y., and Volinsky, C. (2007). Modeling relationships at multiple scales to improve

accuracy of large recommender systems. In Proceedings of the 13th ACM SIGKDD international

conference on Knowledge discovery and data mining, 95–104. ACM.

Booth, J. G., & Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an

automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical

Methodology), 61(1), 265-285.

Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and

statistical learning via the alternating direction method of multipliers. Foundations and Trends® in

Machine Learning, 3(1), 1-122.

Bickel, P. J., Götze, F., & van Zwet, W. R. (2012). Resampling fewer than n observations: gains,

losses, and remedies for losses (pp. 267-297). Springer New York.

Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters.

Communications of the ACM, 51(1), 107-113.

BibliographyEfron, B. (1979). Bootstrap methods: another look at the jackknife. The annals of Statistics, 1-26.

Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. (2012). The big data bootstrap. arXiv preprint

arXiv:1206.6415.

Khanna, R., Zhang, L., Agarwal, D. and Chen, B. (2012). Parallel Matrix Factorization for Binary

Response. In Arxiv.org.

Zhang, L., Agarwal, D., and Chen, B. (2011). Generalizing matrix factorization through flexible

regression priors. In Proceedings of the fifth ACM conference on Recommender systems, 13–20.

ACM.