View
2
Download
0
Category
Preview:
Citation preview
Statistical Computing
For Big Data
Deepak Agarwal, Liang Zhang
LinkedIn Applied Relevance Science
JSM 2013, Montreal, Canada
Structure of This Tutorial
• Part I: Introduction to Map-Reduce and the HadoopSystem– Overview of Distributed Computing
– Introduction to Map-Reduce
– Introduction to the Hadoop System
– The Pig Language
– A Deep Dive of Hadoop Map-Reduce
• Part II: Examples of Statistical Computing for Big Data– Bag of Little Bootstraps
– Large Scale Logistic Regression
– Parallel Matrix Factorization
• Part III: The Future of Cloud Computing
Structure of This Tutorial
• Part I: Introduction to Map-Reduce and the HadoopSystem– Overview of Distributed Computing
– Introduction to Map-Reduce
– Introduction to the Hadoop System
– The Pig Language
– A Deep Dive of Hadoop Map-Reduce
• Part II: Examples of Statistical Computing for Big Data– Bag of Little Bootstraps
– Large Scale Logistic Regression
– Parallel Matrix Factorization
• Part III: The Future of Cloud Computing
Big Data becoming Ubiquitous
• Bioinformatics
• Astronomy
• Internet
• Telecommunications
• Climatology
• …
Big Data: Some size estimates
• 1000 human genomes: > 100TB of data (1000 genomes project)
• Sloan Digital Sky Survey: 200GB data per night (>140TB aggregated)
• Facebook: A billion monthly active users
• LinkedIn: 225M members worldwide
• Twitter: 500 million tweets a day
• Over 6 billion mobile phones in the world generating data everyday
Big Data: Paradigm shift
• Classical Statistics– Generalize using small data
• Paradigm Shift with Big Data– We now have an almost infinite supply of data– Easy Statistics ? Just appeal to asymptotic theory?
• So the issue is mostly computational?
– Not quite• More data comes with more heterogeneity
• Need to change our statistical thinking to adapt– Classical statistics still invaluable to think about big data analytics
Some Statistical Challenges
• Exploratory Analysis (EDA), Visualization– Retrospective (on Terabytes)
– More Real Time (streaming computations every few minutes/hours)
• Statistical Modeling– Scale (computational challenge)
– Curse of dimensionality • Millions of predictors, heterogeneity
– Temporal and Spatial correlations
Statistical Challenges continued
• Experiments
– To test new methods, test hypothesis from
randomized experiments
– Adaptive experiments
• Forecasting
– Planning, advertising
• Many more we are not fully well versed in
Defining Big Data
• How to know you have the big data problem?
– Is it only the number of terabytes ?
– What about dimensionality,
structured/unstructured, computations
required,…
• No clear definition, let’s make up one
– When desired computation cannot be completed
in the stipulated time with current best algorithm
using cores available on a commodity PC
– Agree ? Other suggestions ?
Distributed Computing for Big Data
• Distributed computing invaluable tool to scale computations for big data
• Some distributed computing models
– Multi-threading
– Graphics Processing Units (GPU)
– Message Passing Interface (MPI)
– Map-Reduce
Evaluating a method for a problem
• Scalability– Process X GB in Y hours
• Ease of use for a statistician
• Reliability (fault tolerance)
• Cost– Hardware and cost of maintaining
• Good for the computations required?– E.g., Iterative versus one pass
• Resource sharing
Multithreading
• Multiple threads take advantage of multiple CPUs
• Shared memory
• Threads can execute independently and concurrently
• Can only handle Gigabytes of data
• Reliable
Graphics Processing Units (GPU)
• Number of cores:– CPU: Order of 10– GPU: smaller cores
• Order of 1000
• Can be >100x faster than CPU– Parallel computationally intensive tasks off-loaded to GPU
• Good for certain computationally-intensive tasks
• Can only handle Gigabytes of data
• Not trivial to use, requires good understanding of low-level architecture for efficient use
Message Passing Interface (MPI)
• Language independent communication protocol among processes (e.g. computers)
• Most suitable for master/slave model
• Can handle Terabytes of data
• Good for iterative processing
• Fault tolerance is low
Map-Reduce (Dean & Ghemawat,
2004)
Mappers
Reducers
Data
Output
• Computation split to Map (scatter) and Reduce (gather) stages
• Easy to Use: – User needs to implement two
functions: Mapper and Reducer
• Easily handles Terabytes of data
• Very good fault tolerance (failed tasks automatically get restarted)
Comparison of Distributed Computing Methods
Multithreading GPU MPI Map-Reduce
Scalability (data
size)
Gigabytes Gigabytes Terabytes Terabytes
Fault Tolerance High High Low High
Maintenance Cost Low Medium Medium Medium-High
Iterative Process
Complexity
Cheap Cheap Cheap Usually
expensive
Resource Sharing Hard Hard Easy Easy
Easy to Implement? Easy Needs
understanding
of low-level GPU
architecture
Easy Easy
Example Problem
• Tabulating word counts in corpus of documents
• Similar to table function in R
• Single machine: Go through each word in each document and count, however– Documents are stored in different nodes!
– Takes forever for big corpus
• MPI: Each slave node takes a subset of documents; Master node summarizes the result from the slaves, however– Both master node and slave nodes may fail!
Word Count Through Map-Reduce
Hello World
Bye World
Hello Hadoop
Goodbye Hadoop
Mapper 1
Mapper 2
Reducer 1
Words from A-G
Reducer 2
Words from H-Z
Key Ideas about Map-Reduce
Big Data
Partition 1 Partition 2 … Partition N
Mapper 1 Mapper 2 … Mapper N
Reducer 1 Reducer 2 Reducer M…
Output 1 Output 1Output 1Output 1
Key Ideas about Map-Reduce
• Data are split into partitions and stored in many different machines on disk (distributed storage)
• Mappers process data chunks independently and emit pairs
• Data with the same key are sent to the same reducer. One reducer can receive multiple keys
• Every reducer sorts its data by key
• For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output
Compute Mean for Each Group
ID Group No. Score
1 1 0.5
2 3 1.0
3 1 0.8
4 2 0.7
5 2 1.5
6 3 1.2
7 1 0.8
8 2 0.9
9 4 1.3
… … …
Key Ideas about Map-Reduce
• Data are split into partitions and stored in many different machines on disk (distributed storage)
• Mappers process data chunks independently and emit pairs– For each row:
• Key = Group No.
• Value = Score
• Data with the same key are sent to the same reducer. One reducer can receive multiple keys– E.g. 2 reducers
– Reducer 1 receives data with key = 1, 2
– Reducer 2 receives data with key = 3, 4
• Every reducer sorts its data by key– E.g. Reducer 1: ,
• For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output– E.g. Reducer 1 output: ,
Key Ideas about Map-Reduce
• Data are split into partitions and stored in many different machines on disk (distributed storage)
• Mappers process data chunks independently and emit pairs– For each row:
• Key = Group No.
• Value = Score
• Data with the same key are sent to the same reducer. One reducer can receive multiple keys– E.g. 2 reducers
– Reducer 1 receives data with key = 1, 2
– Reducer 2 receives data with key = 3, 4
• Every reducer sorts its data by key– E.g. Reducer 1: ,
• For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output– E.g. Reducer 1 output: ,
What you need
to implement
Mapper:
Input: Data
for (row in Data)
{
groupNo = row$groupNo;
score = row$score;
Output(c(groupNo, score));
}
Reducer:Reducer:
Input: Key (groupNo), List Value (a list of scores that belong to the Key)
count = 0;
sum = 0;
for (v in Value)
{
sum += v;
count++;
}
Output(c(Key, sum/count));
Pseudo Code (in R)
Exercise 1
• Problem: Average height per {Grade, Gender}?
• What should be the mapper output key?
• What should be the mapper output value?
• What are the reducer input?
• What are the reducer output?
• Write mapper and reducer for this?
Student ID Grade Gender Height (cm)
1 3 M 120
2 2 F 115
3 2 M 116
… … …
• Problem: Average height per Grade and Gender?
• What should be the mapper output key?– {Grade, Gender}
• What should be the mapper output value?– Height
• What are the reducer input?– Key: {Grade, Gender}, Value: List of Heights
• What are the reducer output?– {Grade, Gender, mean(Heights)}
Student ID Grade Gender Height (cm)
1 3 M 120
2 2 F 115
3 2 M 116
… … …
Exercise 2
• Problem: Number of students per {Grade, Gender}?
• What should be the mapper output key?
• What should be the mapper output value?
• What are the reducer input?
• What are the reducer output?
• Write mapper and reducer for this?
Student ID Grade Gender Height (cm)
1 3 M 120
2 2 F 115
3 2 M 116
… … …
• Problem: Number of students per {Grade, Gender}?
• What should be the mapper output key?– {Grade, Gender}
• What should be the mapper output value?– 1
• What are the reducer input?– Key: {Grade, Gender}, Value: List of 1’s
• What are the reducer output?– {Grade, Gender, sum(value list)}
– OR: {Grade, Gender, length(value list)}
Student ID Grade Gender Height (cm)
1 3 M 120
2 2 F 115
3 2 M 116
… … …
More on Map-Reduce
• Depends on distributed file systems
• Typically mappers are the data storage nodes
• Map/Reduce tasks automatically get restarted when they fail (good fault tolerance)
• Map and Reduce I/O are all on disk– Data transmission from mappers to reducers are
through disk copy
• Iterative process through Map-Reduce– Each iteration becomes a map-reduce job– Can be expensive since map-reduce overhead is high
Introduction To The
Hadoop System
The Apache Hadoop System
• An open-source software for reliable, scalable, distributed computing
• The most popular distributed computing system in the world
• Key modules:– Hadoop Distributed File System (HDFS)
– Hadoop YARN (job scheduling and cluster resource management)
– Hadoop MapReduce
Major Tools on Hadoop
• Pig– A high-level language for Map-Reduce computation
• Hive– A SQL-like query language for data querying via Map-Reduce
• Hbase– A distributed & scalable database on Hadoop
– Allows random, real time read/write access to big data
– Voldemort is similar to Hbase
• Mahout– A scalable machine learning library
• …
What is Covered In This Tutorial
• Hadoop Distributed File System (HDFS)
• Pig
• A Deep dive of MapReduce and Hadoop
• Running R on Hadoop
Hadoop Installation
• Setting up Hadoop on your desktop/laptop:
– http://hadoop.apache.org/docs/stable/single_nod
e_setup.html
• Setting up Hadoop on a cluster of machines
– http://hadoop.apache.org/docs/stable/cluster_set
up.html
Hadoop Distributed File System (HDFS)
• Master/Slave architecture
• NameNode: a single master node that controls which data block is stored where.
• DataNodes: slave nodes that store data and do R/W operations
• Clients (Gateway): Allow users to login and interact with HDFS and submit Map-Reduce jobs
• Big data is split to equal-sized blocks, each block can be stored in different DataNodes
• Disk failure tolerance: data is replicated multiple times
A Typical User Session on Hadoop
• Log in to a Hadoop client machine
• Interact with HDFS to– Upload the data from local / Locate where the data is
• Submit a Map-Reduce job
• Debug the job if it fails via Hadoop Job Tracker
• Interact with HDFS to– Check the output of the job
• Copy the output to local for further analysis if needed
• Log out
HDFS Command HDFS FunctionalityAnalogy to Linux Shell
Command
hdfs dfs -ls
returns list of the direct
children of the path and
their stats
ls
hdfs dfs –cat Output file content to
stdoutcat
hdfs dfs –chmod Change Permission chmod
hdfs dfs –copyFromLocal
Copy from lcoal path to hdfs cp
hdfs dfs –copyToLocal
Copy to local path from hdfs cp
hdfs dfs –mkdir Create a directory mkdir
hdfs dfs –rm Remove a file rm
hdfs dfs –rmr Remove a path rm –r
hdfs dfs –mv Moves files from source to
destinationmv
hdfs dfs –cp Copy files from source to
destinationcp
… … …
HDFS Demo Time!
Pig
• A high-level platform for Map-Reduce on Hadoop
• Pig Latin: The SQL-like intuitive language for Pig
• Can be extended via User Defined Functions (UDF) in Java, Python, etc.
Compute Mean for Each Group
ID Group No. Score
1 1 0.5
2 3 1.0
3 1 0.8
4 2 0.7
5 2 1.5
6 3 1.2
7 1 0.8
8 2 0.9
9 4 1.3
… … …
A = LOAD 'Sample-1.dat' USING PigStorage() AS (ID : int, groupNo: int, score: float);
B = GROUP A BY groupNo;
C = FOREACH B GENERATE group, AVG(A.score) AS mean;
DUMP C;
ID Group No. Score
1 1 0.5
2 3 1.0
3 1 0.8
4 2 0.7
5 2 1.5
6 3 1.2
7 1 0.8
8 2 0.9
9 4 1.3
… … …
File Sample-1.dat
Load the Data into Pig
• A = LOAD 'Sample-1.dat' USING PigStorage() AS (ID : int, groupNo: int, score: float);– The path of the data on HDFS after LOAD
• USING PigStorage() means delimit the data by tab (can be omitted)
• If data are delimited by other characters, e.g. space, use USING PigStorage(' ')
• Data schema defined after AS
• Variable types: int, long, float, double, chararray, …
Tuple
• A Tuple is a data record consisting of a sequence of "fields”
• Each field is a piece of data of any type
• E.g. Each row of A is a Tuple:
{ID: int, groupNo: int, score: float}
Data Bag
• A Data Bag is a set of tuples
• You may think of it as a "table”
• Pig does not require that the tuple field types match, or even that the tuples have the same
number of fields
Filter Data
• D = FILTER A BY score>0.1 AND score
Per Row Operations
• E = FOREACH A GENERATE groupNo, score, score * score AS scoreSq;
• FOREACH…GENERATE does independent per-row operations
• Column “ID” is thrown away
• A new column “ScoreSq” was generated by simply doing score * score
GROUP…BY…
• B = GROUP A BY groupNo;
• A: {ID: int, groupNo: int, score: float}
• B: {group: int, A: {(ID: int, groupNo: int, score: float)}
• Results of GROUP…BY… always have two fields– Field “group” saves the information of the group.
Here, it is the same as groupNo.
– The second field is a data bag named as the variable name after GROUP (e.g. A here), which consists of all the tuples (rows) for this group (e.g. with the same groupNo)
AVG (Average) Operation
• B = GROUP A BY groupNo;
• C = FOREACH B GENERATE group, AVG(A.score) AS mean;
• B: {group: int, A: {(ID: int, groupNo: int, score: float)}
• C: {group: int, mean: double}
• AVG function computes the average of the numeric values in a single-column of data bag
• A.score retrieves the “score” column in the field “A” of data B and forms a new data bag
Other Sample Functions on Data Bags
• SUM: Computes the sum of the numeric values in a single-column bag.
• COUNT: Counts the number of tuples in a data bag
• MIN / MAX: Computes the min/max of the numeric values in a single-column bag.
• IsEmpty: Checks if a data bag is empty
• …
Pig Output
• DUMP function outputs on screen
• STORE outputs to HDFS storage
• For example– DUMP C– STORE C into 'output/Output-1' using PigStorage()– Pig stores the data C into directory with path
output/Output-1
• STORE creates the directory if it doesn’t exist– The job fails if the directory already exists– Use rmf before STORE function to force
overwrite
Pig Demo 1
• Problem: Number of students per {Grade, Gender}?
Student ID Grade Gender Height (cm)
1 3 M 120
2 2 F 115
3 2 M 116
… … …
File Sample-2.dat
A = LOAD 'Sample-2.dat' USING PigStorage() AS (studentID : int, grade: int, gender:
chararray, height: int);
B = GROUP A BY (grade, gender);
C = FOREACH B GENERATE FLATTEN(group) AS (grade, gender), COUNT(A) AS
numStudents;
DUMP C;
GROUP…BY…
• B = GROUP A BY (grade, gender);
• A: {studentID : int, grade: int, gender: chararray, height: int}
• B: {group: (grade: int, gender: chararray),A: {(studentID: int, grade: int, gender: chararray,
height: int)}}
• The “group” field in B now becomes a Tuple with two fields: grade and gender
FLATTEN
• FLATTEN has to be used with FOREACH…GENERATE…
• FLATTEN operates on Tuples or Data Bags
• FLATTEN(a field that is a Tuple) => a set of fields– E.g. A: {a: int, b: (c: int, d: chararray)}
– B = FOREACH A GENERATE a, FLATTEN(b);
– B: {a: int, b::c: int, b::d: chararray}
– B = FOREACH A GENERATE a, FLATTEN(b) AS (c:int, d:chararray);
– B: {a: int, c:int, d:chararray}
FLATTEN
• FLATTEN(a DataBag) generates N records if the bag has N tuples
• It cross products with the other fields in the data
• For example– A: {a: int, b: {(c: int, d: chararray)}}
– The field b is a Data Bag that contains Tuples with two fields: c and d
– B = FOREACH A GENERATE a, FLATTEN(b) as (c:int, d:chararray);
A
(1, {(2,M), (5,F)})
(3, {(4,F), (6,M), (7,F)}
B
(1, 2, M)
(1, 5, F)
(3, 4, F)
(3, 6, M)
(3, 7, F)
Pig Demo 2
JOINA = LOAD 'Sample
(
A = LOAD 'Sample-3a.dat' AS
(a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = LOAD AS (b1:int,b2:int);
(4,9)
B = LOAD 'Sample-3b.dat' AS (b1:int,b2:int);
DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
X = JOIN A BY a1, B BY b1;
(
X = JOIN A BY a1, B BY b1;
DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
JOINA = LOAD 'Sample
(
A = LOAD 'Sample-3a.dat' AS
(a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = LOAD AS (b1:int,b2:int);
(4,9)
B = LOAD 'Sample-3b.dat' AS (b1:int,b2:int);
DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
X = JOIN A BY a1, B BY b1;
(
X = JOIN A BY a1, B BY b1;
DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
JOINA = LOAD 'Sample
(
A = LOAD 'Sample-3a.dat' AS
(a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = LOAD AS (b1:int,b2:int);
(4,9)
B = LOAD 'Sample-3b.dat' AS (b1:int,b2:int);
DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
X = JOIN A BY a1, B BY b1;
(
X = JOIN A BY a1, B BY b1;
DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
Cross-
Product
JOINA = LOAD 'Sample
(
A = LOAD 'Sample-3a.dat' AS
(a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = LOAD AS (b1:int,b2:int);
(4,9)
B = LOAD 'Sample-3b.dat' AS (b1:int,b2:int);
DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
X = JOIN A BY a1, B BY b1;
(
X = JOIN A BY a1, B BY b1;
DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
Records that do not
match got thrown
away
COGROUPA = LOAD 'Sample
(
A = LOAD 'Sample-3a.dat' AS
(a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = LOAD AS (b1:int,b2:int);
(4,9)
B = LOAD 'Sample-3b.dat' AS (b1:int,b2:int);
DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
W = COGROUP A BY a1, B BY b1;
DUMP W;
(1,{(1,2,3)},{(1,3)})
(2,{},{(2,4),(2,7),(2,9)})
(4,{(4,2,1),(4,3,3)},{(4,6),(4,9)})
(7,{(7,2,5)},{})
(8,{(8,3,4),(8,4,3)},{(8,9)})
DESCRIBE W;
W: {group: int, A: {(a1: int,a2: int,a3: int)}, B: {(b1: int,b2: int)}}
COGROUP and JOIN
• X = JOIN A BY a1, B BY b1;
• X: {A::a1: int,A::a2: int,A::a3: int,B::b1: int,B::b2: int}
• W = COGROUP A BY a1, B BY b1;
• X = FOREACH W GENERATE FLATTEN(A), FLATTEN(B);
• X: {A::a1: int,A::a2: int,A::a3: int,B::b1: int,B::b2: int}
COGROUP and JOIN
• X = JOIN A BY a1, B BY b1;
• X: {A::a1: int,A::a2: int,A::a3: int,B::b1: int,B::b2: int}
• W = COGROUP A BY a1, B BY b1;
• X = FOREACH W GENERATE FLATTEN(A), FLATTEN(B);
• X: {A::a1: int,A::a2: int,A::a3: int,B::b1: int,B::b2: int}
Equivalent
COGROUP and JOIN
• X = JOIN A BY a1, B BY b1;
• X: {A::a1: int,A::a2: int,A::a3: int,B::b1: int,B::b2: int}
• W = COGROUP A BY a1, B BY b1;
• X = FOREACH W GENERATE FLATTEN(A), FLATTEN(B);
• X: {A::a1: int,A::a2: int,A::a3: int,B::b1: int,B::b2: int}
Equivalent
COGROUP is more
flexible
Pig Demo 3
• Students at schools take tests (e.g. SAT)
• One student can take the same test multiple times
• We take Max(scores) from one student as his/her final score
• Problem: The top 3 schools with most number of students having final score > 90?
Student ID School Score
1 B 89
1 B 90
3 B 75
2 A 60
4 C 90
… … …
File Sample-4.dat
• Problem: The top 3 schools with most number of students having final score > 90?
A = LOAD A = LOAD 'Sample-4.dat' AS (studentID: int, school: chararray, score:
int);
B = GROUP A BY school;
C = FOREACH B {
D = FILTER A BY score>90;
E = DISTINCT D.studentID;
GENERATE FLATTEN(group) AS school, COUNT(E) AS count;
}
F = ORDER C BY count DESC;
G = LIMIT F 3;
DUMP G;
• Problem: The top 3 schools with most number of students having final score > 90?
A = LOAD A = LOAD 'Sample-4.dat' AS (studentID: int, school: chararray, score:
int);
B = GROUP A BY school;
C = FOREACH B {
D = FILTER A BY score>90;
E = DISTINCT D.studentID;
GENERATE FLATTEN(group) AS school, COUNT(E) AS count;
}
F = ORDER C BY count DESC;
G = LIMIT F 3;
DUMP G;
• Nested Block for FOREACH…GENERATE
• The last statement in the nested block must be GENERATE.
• Problem: The top 3 schools with most number of students having final score > 90?
A = LOAD A = LOAD 'Sample-4.dat' AS (studentID: int, school: chararray, score:
int);
B = GROUP A BY school;
C = FOREACH B {
D = FILTER A BY score>90;
E = DISTINCT D.studentID;
GENERATE FLATTEN(group) AS school, COUNT(E) AS count;
}
F = ORDER C BY count DESC;
G = LIMIT F 3;
DUMP G;
• B: {group: int, A: {(studentID: int, school: chararray, score: int)}}
• For each record in B, A is a field that is a Data Bag that stores the
data from the group (school).
• D filters the Data Bag A by score>90 and generates a new Data Bag
• E takes the student ID field from D, forms a new Data Bag with
single field tuples (D.studentID), and performs “unique” operation
by DINSTINCT
• Problem: The top 3 schools with most number of students having final score > 90?
A = LOAD A = LOAD 'Sample-4.dat' AS (studentID: int, school: chararray, score:
int);
B = GROUP A BY school;
C = FOREACH B {
D = FILTER A BY score>90;
E = DISTINCT D.studentID;
GENERATE FLATTEN(group) AS school, COUNT(E) AS count;
}
F = ORDER C BY count DESC;
G = LIMIT F 3;
DUMP G;
• ORDER…BY is the sort function implemented in map-reduce
• DESC means sort in the descending order. If without DESC, it means
ascending order.
• Q: What does this mean?
H = ORDER A by score DESC, studentID;
• Problem: The top 3 schools with most number of students having final score > 90?
A = LOAD A = LOAD 'Sample-4.dat' AS (studentID: int, school: chararray, score:
int);
B = GROUP A BY school;
C = FOREACH B {
D = FILTER A BY score>90;
E = DISTINCT D.studentID;
GENERATE FLATTEN(group) AS school, COUNT(E) AS count;
}
F = ORDER C BY count DESC;
G = LIMIT F 3;
DUMP G;
• LIMIT operation here picks the top 3 records from all the records
Pig Demo 4
User Defined Functions in Pig (UDF)
• Supports customized computation in mapper/reducer stages
• Best to write in Java
• Also supports Python
Calling Java UDF in Pig
• Example: Change the name field to upper case
REGISTER myudfs.jar;
A = LOAD 'student_data' AS (name: chararray, age:
int, gpa: float);
B = FOREACH A GENERATE myudfs.UPPER(name);
DUMP B;
Calling Java UDF in Pig
• Example: Change the name field to upper case
REGISTER myudfs.jar;
A = LOAD 'student_data' AS (name: chararray, age:
int, gpa: float);
B = FOREACH A GENERATE myudfs.UPPER(name);
DUMP B;
Register the jar file that contains
The UDF java class
Calling Java UDF in Pig
• Example: Change the name field to upper case
REGISTER myudfs.jar;
A = LOAD 'student_data' AS (name: chararray, age:
int, gpa: float);
B = FOREACH A GENERATE myudfs.UPPER(name);
DUMP B;
Calling the UDF to change name
to upper case
package
}
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
public class UPPER extends EvalFunc (String) {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) return null;
try{
String str = (String) input.get(0);
return str.toUpperCase();
} catch(Exception e){
throw WrappedIOException.wrap("Caught
exception processing input row ", e);
}
}
}
package
}
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
public class UPPER extends EvalFunc (String) {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) return null;
try{
String str = (String) input.get(0);
return str.toUpperCase();
} catch(Exception e){
throw WrappedIOException.wrap("Caught
exception processing input row ", e);
}
}
}
Used in
FOREACH…GENERA
TE context
package
}
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
public class UPPER extends EvalFunc (String) {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) return null;
try{
String str = (String) input.get(0);
return str.toUpperCase();
} catch(Exception e){
throw WrappedIOException.wrap("Caught
exception processing input row ", e);
}
}
}
Output type
package
}
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
public class UPPER extends EvalFunc (String) {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) return null;
try{
String str = (String) input.get(0);
return str.toUpperCase();
} catch(Exception e){
throw WrappedIOException.wrap("Caught
exception processing input row ", e);
}
}
}
Input type is always Tuple, which is a Tuple of parameters of the
function
package
}
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
public class UPPER extends EvalFunc (String) {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) return null;
try{
String str = (String) input.get(0);
return str.toUpperCase();
} catch(Exception e){
throw WrappedIOException.wrap("Caught
exception processing input row ", e);
}
}
}
For this example, only 1 parameter of the
function, which is the string to be upper cased
package
}
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
public class UPPER extends EvalFunc (String) {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) return null;
try{
String str = (String) input.get(0);
return str.toUpperCase();
} catch(Exception e){
throw WrappedIOException.wrap("Caught
exception processing input row ", e);
}
}
}
Return the upper-
cased string
Summary of Pig
• A really nice interactive tool for HadoopMapReduce beginners
• Efficient for quick statistical data analysis on big data (very little coding time)
• UDF support enables comprehensive MapReduce jobs
A Deep Dive into Hadoop
Map-Reduce
Hadoop Map-Reduce in Java
• A natural way for implementing Map-Reduce since Hadoop is written in Java
• A Map-Reduce job class contains at least– A Mapper class
– A Reducer class
– A main function that sets up the configs and runs the job
• More info: http://hadoop.apache.org/docs/stable/mapred_tutorial.html
Word Count Through Map-Reduce
Hello World
Bye World
Hello Hadoop
Goodbye Hadoop
Mapper 1
Mapper 2
Reducer 1
Words from A-G
Reducer 2
Words from H-Z
Word Count Through Map-Reduce
Hello World
Bye World
Hello Hadoop
Goodbye Hadoop
Mapper 1
Mapper 2
Reducer 1
Words from A-G
Reducer 2
Words from H-Z
Large amount of data
to transfer from mappers
to reducers for large data
How About…
Hello World
Bye World
Hello Hadoop
Goodbye Hadoop
Mapper 1
Mapper 2
Reducer 1
Words from A-G
Reducer 2
Words from H-Z
Combiners
• A reducer-like class that happens in mapper stage to aggregate data to sufficient statistics
before sending to reducers
• Can significantly improve Hadoop Map-Reduce performance
• Pig: Automatically applies combiners
Hello World
Bye World
Hello Hadoop
Goodbye Hadoop
Mapper 1
Mapper 2
Reducer 1
Words from A-G
Reducer 2
Words from H-Z
Combiner 1
Combiner 2
Partitioners
• Determines which pair is sent to which reducer
• Default: random partitioning based on key’s hash value
• Can be overwritten according to customized needs
Number of Reducers
• Too small– The job runs forever
• Too big– Waiting time to obtain many reducer nodes and starting those nodes can be
long
– A set of tiny output files cause• Hadoop namespace usage to be inefficient
• Jobs who read this output as input use unnecessary large number of mappers
• Ideally– Each reducer runs for at least several minutes, but not too long
– Output partition size is at least a few hundred MB
• In Pig– B = GROUP A BY name PARALLEL 10;
– C = COGROUP A BY name, B BY name PARALLEL 10;
– B = ORDER A BY NAME PARALLEL 10;
– …
More Efficient Hadoop Jobs
• Why combiners?– Use combiners if necessary to reduce the amount
of data transferred from mappers to reducers
• Why partitioners?– Use partitioners to make sure the amount of
computation time for each reducer is similar
• Why optimizing number of reducers?– Optimize Map-Reduce running time
– Optimize namespace usage
More Efficient Pig Scripts
• Remove unnecessary fields early and often using FOREACH…GENERATE
• Filter early and often
• COGROUP and JOIN: smaller files always on the left
– A = join small by x, large by y;
• Use PARALLEL to tune the parallelism of the job (Q: Why?)
Hadoop Streaming
• A utility to create and run Map-Reduce with any executable or script as mapper/reducer
• For example
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-reducer /bin/wc
-file myPythonScript.py
Hadoop Streaming
• A utility to create and run Map-Reduce with any executable or script as mapper/reducer
• For example
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-reducer /bin/wc
-file myPythonScript.py
The python script shipped to clusters during job
submission
Running R using Hadoop Streaming
• Install R on Hadoop gateway machines
• Package R dir into R.jar by "jar cvf R.jar -C R/ .”
• Copy R.jar from local to Hadoop
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper "myR/bin/R --vanilla --slave --no-restore -f mapper.R" \
-reducer "myR/bin/R --vanilla --slave --no-restore -f reducer.R" \
-file mapper.R \
-file reducer.R \
-cacheArchive 'hdfs://[nameNodeURL]:9000/[HDFS location]/R.jar#myR' \
-cmdenv R_HOME_DIR=myR \
Running R Using Hadoop
Streaming
• cacheArchive ships R.jar to each mapper/reducer node, unzips the jar to
directory myR, and execute R from myR/bin/R
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper "myR/bin/R --vanilla --slave --no-restore -f mapper.R" \
-reducer "myR/bin/R --vanilla --slave --no-restore -f reducer.R" \
-file mapper.R \
-file reducer.R \
-cacheArchive 'hdfs://[nameNodeURL]:9000/[HDFS location]/R.jar#myR' \
-cmdenv R_HOME_DIR=myR \
RHive
• An R package that interacts with Hive (SQL-like in Hadoop) from R
• Connects R objects and functions with Hive
• http://cran.r-project.org/web/packages/RHive/RHive.pdf
RHIPE
• Another R package that aims to interact with Hadoop HDFS and Map-Reduce
• Sample R functions– rhput: Put a file to HDFS– rhget: Copy a file to local from HDFS– rhmr: Prepares a Map-Reduce job for execution– rhex: Execute a Map-Reduce job– rhkill: Kill a Map-Reduce job– …
• http://www.datadr.org/getpack.html
Summary of Part I
• Big Data Overview: Statistical perspective
• Map-Reduce : An increasingly popular computing system to analyze big data
• Hadoop System: An open source implementation of Map-Reduce
• Pig: High level Hadoop Map-Reduce language (like R for big data)
Structure of This Tutorial
• Part I: Introduction to Map-Reduce and the HadoopSystem– Overview of Distributed Computing
– Introduction to Map-Reduce
– Introduction to the Hadoop System
– The Pig Language
– A Deep Dive of Hadoop Map-Reduce
• Part II: Examples of Statistical Computing for Big Data– Bag of Little Bootstraps
– Large Scale Logistic Regression
– Parallel Matrix Factorization
• Part III: The Future of Cloud Computing
Bag of Little Bootstraps
Kleiner et al. 2012
Bootstrap (Efron, 1979)
• A re-sampling based method to obtain statistical distribution of sample estimators
• Why are we interested ?– Re-sampling is embarrassingly parallelizable
• For example:– Standard deviation of the mean of N samples (μ)
– For i = 1 to r do • Randomly sample with replacement N times from the original
sample -> bootstrap data i
• Compute mean of the i-th bootstrap data -> μi– Estimate of Sd(μ) = Sd([μ1,…μr])
– r is usually a large number, e.g. 200
Bootstrap for Big Data
• Can have r nodes running in parallel, each sampling one bootstrap data
• However…
– N can be very large
– Data may not fit into memory
– Collecting N samples with replacement on each
node can be computationally expensive
M out of N Bootstrap
(Bikel et al. 1997)
• Obtain SdM(μ) by sampling M samples with
replacement for each bootstrap, where M
Bag of Little Bootstraps (BLB)
• Example: Standard deviation of the mean • Generate S sampled data sets, each obtained by random
sampling without replacement a subset of size b (or partition the original data into S partitions, each with size b)
• For each data p = 1 to S do– For i = 1 to r do
• N samples with replacement on data of size b
• Compute mean of the resampled data μpi– Compute Sdp(μ) = Sd([μp1,…μpr])
• Estimate of Sd(μ) = Avg([Sd1(μ),…, SdS(μ)])
Bag of Little Bootstraps (BLB)
• Interest: ξ(θ), where θ is an estimate obtained from size N data– ξ is some function of θ, such as standard deviation, …
• Generate S sampled data sets, each obtained from random sampling without replacement a subset of size b (or partition the original data into S partitions, each with size b)
• For each data p = 1 to S do– For i = 1 to r do
• Sample N samples with replacement on data of size b• Compute the estimate of the resampled data -> θpi
– Compute ξp(θ) = ξ([θp1,…θpr])
• Estimate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])
***Typo in hand out!
Bag of Little Bootstraps (BLB)
• Interest: ξ(θ), where θ is an estimate obtained from size N data– ξ is some function of θ, such as standard deviation, …
• Generate S sampled data sets, each obtained from random sampling without replacement a subset of size b (or partition the original data into S partitions, each with size b)
• For each data p = 1 to S do– For i = 1 to r do
• Sample N samples with replacement on the data of size b• Compute the estimate of the resampled data -> θpi
– Compute ξp(θ) = ξ([θp1,…θpr])
• Estimate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])
MapperReducer
Gateway
***Typo in hand out!
Why is BLB Efficient
• Before:– N samples with replacement from size N data is
expensive when N is large
• Now:– N samples with replacement from size b data
– b can be several magnitude smaller than N (e.g. b = Nγ, γ in [0.5, 1))
– Equivalent to: A multinomial sampler with dim = b
– Storage = O(b), Computational complexity = O(b)
Simulation Experiment
• 95% CI of Logistic Regression Coefficients
• N = 20000, 10 explanatory variables
• Relative Error = |Estimated CI width – True CI width | / True CI width
• BLB-γ: BLB with b = Nγ
• BOFN-γ: b out of N sampling with b = Nγ
• BOOT: Naïve bootstrap
Simulation Experiment
Real Data
• 95% CI of Logistic Regression Coefficients
• N = 6M, 3000 explanatory variables
• Data size = 150GB, r = 50, S = 5, γ = 0.7
Data stored on disk Data stored in memory using Spark
Hyper-parameter Selection
• From empirical experiments in the paper
– S>=3 and r>=50 is sufficient for low relative errors
• Adaptively selecting r and S
– Increase r and s until estimated value converges
– For r: Continue to process resamples and update ξp(θ) until it has ceased to change significantly.
– For S: Continue to process more subsamples (i.e., increasing S) until BLB’s output value has stabilized.
***New Slide!
Hyper-parameter Selection
BLB-0.7
• Adaptively selecting r and S
– Increase r and S until estimated value converges
Adaptive BLB
Summary of BLB
• A new algorithm for bootstrapping on big data
• Advantages
– Fast and efficient
– Easy to parallelize
– Easy to understand and implement
– Friendly to Hadoop, makes it routine to perform
statistical calculations on Big data
Large Scale Logistic Regression
Deepak Agarwal, Bee-Chung Chen, Bo Long,
Liang Zhang, Xianxing Zhang
Applied Relevance Science at LinkedIn
Logistic Regression
• Binary response: Y
• Covariates: X
• Yi ~ Bernoulli(pi)
• log(pi/(1-pi)) = XiTβ ; β ~ MVN(0 , 1/λ I )
• Widely used (research and applications)
item j from a set of candidates
User i
with
user features(e.g., industry,
behavioral features,
Demographic
features,……)
(i, j) : response yijvisits
Algorithm selects
(click or not)
Which item should we recommend to the user?
���� The item with highest predicted response rate
• Logistic Regression effective technique
Response Prediction: Application of logistic regression
in recommender systems
Examples of Recommender Systems
Similar problem:
Content recommendation on Yahoo!
front page
Recommend content links
(out of 30-40, editorially
programmed)
4 slots exposed, F1 has
maximum exposure
Routes traffic to other Y! properties
F1 F2 F3 F4
Today module
Right Media Ad Exchange: Unified Marketplace
Match ads to page views on publisher sites
Has ad
impression to
sell --
AUCTIONS
Bids $0.50Bids $0.75 via Network…
… which becomes
$0.45 bid
Bids $0.65—WINS!
AdSenseAd.com
Bids $0.60
Logistic Regression for
Response Prediction• Binary response: Y
– Click / Non-click
• Covariates: X– User covariates:
• Age, gender, industry, education, job, job title, …
– Item covariates:• Categories, keywords, topics, …
– Context covariates:• Time, page type, position, …
– 2-way interaction:• User covariates X item covariates• Context covariates X item covariates• …
Computational Challenge
• Hundreds of millions/billions of observations
• Hundreds of thousands/millions of covariates
• Fitting such a logistic regression model on a single machine not feasible
• Model fitting iterative using methods like gradient descent, Newton’s method etc
– Multiple passes over the data
Recap on Optimization method
• Problem: Find x to min(F(x))
• Iteration n: xn = xn-1 – bn-1 F’(xn-1)
• bn-1 is the step size that can change every iteration
• Iterate until convergence
• Conjugate gradient, LBFGS, Newton trust region, … all of this kind
Iterative Process with Hadoop
Disk Mappers Disk Reducers
DiskMappersDiskReducers
Disk Mappers Disk Reducers
Limitations of Hadoop for fitting a big
logistic regression
• Iterative process is expensive and slow
• Every iteration = a Map-Reduce job
• I/O of mapper and reducers are both through disk
• Plus: Waiting in queue time
• Q: Can we find a fitting method that scales with Hadoop ?
Large Scale Logistic Regression
• Naïve: – Partition the data and run logistic regression for each partition
– Take the mean of the learned coefficients
– Problem: Not guaranteed to converge to the model from single machine!
• Alternating Direction Method of Multipliers (ADMM)– Boyd et al. 2011
– Set up constraints: each partition’s coefficient = global consensus
– Solve the optimization problem using Lagrange Multipliers
– Advantage: guaranteed to converge to a single machine logistic regression on the entire data with reasonable number of iterations
Large Scale Logistic Regression via ADMM
BIG DATA
Partition 1 Partition 2 Partition 3 Partition K
Logistic
Regression
Logistic
Regression
Logistic
Regression
Logistic
Regression
Consensus
Computation
Iteration 1
Large Scale Logistic Regression via ADMM
BIG DATA
Partition 1 Partition 2 Partition 3 Partition K
Logistic
Regression
Consensus
Computation
Logistic
Regression
Logistic
Regression
Logistic
Regression
Iteration 1
Large Scale Logistic Regression via ADMM
BIG DATA
Partition 1 Partition 2 Partition 3 Partition K
Logistic
Regression
Logistic
Regression
Logistic
Regression
Logistic
Regression
Consensus
Computation
Iteration 2
Details of ADMM
Dual Ascent Method
• Consider a convex optimization problem
• Lagrangian for the problem:
• Dual Ascent:
Augmented Lagrangians
• Bring robustness to the dual ascent method
• Yield convergence without assumptions like strict convexity or finiteness of f
•
• The value of ρ influences the convergence rate
Alternating Direction Method of
Multipliers (ADMM)
• Problem:
• Augmented Lagrangians
• ADMM:
Large Scale Logistic Regression via ADMM
• Notation
– (Xi , yi): data in the ith partition
– βi: coefficient vector for partition i
– β: Consensus coefficient vector
– r(β): penalty component such as ||β||22
• Optimization problem
ADMM updates
LOCAL REGRESSIONS
Shrinkage towards current
best global estimate
UPDATED
CONSENSUS
An example implementation
• ADMM for Logistic regression model fitting with L2/L1 penalty
• Each iteration of ADMM is a Map-Reduce job– Mapper: partition the data into K partitions
– Reducer: For each partition, use liblinear/glmnet to fit a L1/L2 logistic regression
– Gateway: consensus computation by results from all reducers, and sends back the consensus to each reducer node
KDD CUP 2010 Data
• Bridge to Algebra 2008-2009 data in
https://pslcdatashop.web.cmu.edu/KDDCup/do
wnloads.jsp
• Binary response, 20M covariates
• Only keep covariates with >= 10 occurrences => 2.2M covariates
• Training data: 8,407,752 samples
• Test data : 510,302 samples
Avg Training Loglikelihood vs Number
of Iterations
Test AUC vs Number of Iterations
Better Convergence Can
Be Achieved By
• Better Initialization
– Use results from Naïve method to initialize the
parameters
• Adaptively change step size (ρ) for each iteration based on the convergence status of
the consensus
Still…
• ADMM in hadoop can take hours to converge
• Is there a better way to handle iterative learning process in Hadoop?
Parallel Matrix Factorization
Deepak Agarwal, Bee-Chung Chen,
Rajiv Khanna, Liang Zhang
Applied Relevance Science at LinkedIn
Personalized Webpage Is Everywhere
Personalized Webpage Is Everywhere
Common Properties of Web
Personalization Problem• One or multiple metrics to optimize
– Click Through Rate (CTR) (focus of this talk)
– Revenue per impression
– Time spent on the landing page
– Ad conversion rate
– …
• Large scale data– MapReduce to solve the problem!
• Sparsity
• Cold-start– User features: Age, gender, position, industry, …
– Item features: Category, key words, creator features, …
Problem Setup
• CTR prediction for a user on an item
• Assumptions: – There are sufficient data per item to estimate per-item model
– Serving bias and positional bias are removed by randomly serving scheme
– Item popularities are quite dynamic and have to be estimated in real-time fashion
• Examples:– Yahoo! Front page Today module
– Linkedin Today module
Online Logistic Regression (OLR)� User i with feature xi, article j
� Binary response y (click/non-click)
�
�
� Prior
� Using Laplace approximation or variational Bayesian methods to obtain posterior
� New prior
� Can approximate and as diagonal for high dim xi
User Covariates for OLR
• Age, gender, industry, job position for login users
• General behavior targeting (BT) covariates– Music? Finance? Politics?
• User profiles from historical view/click behavior on previous items in the data, e.g.– Item-profile: use previously clicked item ids as the user profile
– Category-profile: use item category affinity score as profile. The score can be simply user’s historical CTR on each category.
– Are there better ways to generate user profiles?
– Yes! By matrix factorization!
Generalized Matrix Factorization
(GMF) Framework
•
Global
Features
Item
effect
User
factors
Item
factors
User
effect
Bell et al. (2007)
Regression Priors
•
• g(·), h(·), G(·), H(·) can be any regression functions
• Agarwal and Chen (KDD 2009); Zhang et al. (RecSys 2011)
User covariates
Item covariates
Different Types of Prior Regression
Models• Zero prior mean
– Bilinear random effects (BIRE)
• Linear regression– Simple regression (RLFM)– Lasso penalty (LASSO)
• Tree Models– Recursive partitioning (RP)– Random forests (RF)– Gradient boosting machines (GB)– Bayesian additive regression trees (BART)
Several Model Fitting Approaches
• Gibbs Sampling
• Stochastic Gradient Descent
• Monte Carlo Expectation-Maximization (MCEM)
• For now: Single machine only (will discuss Parallel
Matrix Factorization later)
Gibbs Sampling
• Put additional priors on f(·), g(·), h(·), G(·), H(·), σα, σβ, σu, σv
• For each iteration, sample full conditional posteriors of α, β, u, v, f, g, h, G, H, σα, σβ, σu, σv
• Need plenty of iterations to converge and obtain reasonable posterior mean estimates of these parameters
• When data is large, not feasible for single machine
• Iterative property of Gibbs Sampling makes parallelization on Hadoop not feasible
Stochastic Gradient Descent (SGD)
• A popular model fitting approach for matrix
factorization since Netflix competition
•
U and V are unknown coefficient matrices for cold-
start to map xi and xj to low-dim latent space
• Loss function L =
Stochastic Gradient Descent (SGD)
• Assume data have N samples
• Loss function L = Σk Lk• For k = 1, 2, …, N do
• Can run multiple passes of data to achieve convergence
• ρk is the step size for each observation k
• Monte Carlo EM (Booth and Hobert 1999)
• Let
• Let
• E Step:– Obtain N samples of conditional posterior
• M Step:
Our Approach: MCEM
Handling Binary Responses
• Gaussian responses:
have closed form
• Binary responses + Logistic: no longer closed form
• Variational approximation (VAR)
• Adaptive rejection sampling (ARS)
Variational Approximation (VAR)
• Initially let ξij = 1
• Before each E-step, create pseudo Gaussian response for each binary observation
• Run E-Step and M-Step using the Gaussian pseudo response
• After M-step, let
Adaptive Rejection Sampling (ARS)
• For each E-step, obtain precise conditional posterior samples of
• Adaptively update the upper and lower bound of the log-likelihood to do rejection sampling
Simulation Study
• 10 simulated data sets, 100K samples for both training and test
• 1000 users and 1000 items in training
• Extra 500 new users and 500 new items in test + old users/items
• For each user/item, 200 covariates, only 10 useful
• Construct non-linear regression model from 20 Gaussian functions for simulating α, β, u and v following Friedman (2001)
MovieLens 1M Data Set• 1M ratings with scale 1-5
• 6040 users
• 3706 movies
• Sort by time, first 75% training, last 25% test
• A lot of new users in the test data set
• User covariates: Age, gender, occupation, zip code
• Item covariates: Movie genre
Performance Comparison
However…
• We are working with very large scale data sets!
• Parallel matrix factorization methods using Map-Reduce have to be developed!
• Khanna et al. 2012 Technical report
• Monte Carlo EM (Booth and Hobert 1999)
• Let
• Let
• E Step:– Obtain N samples of conditional posterior
• M Step:
Model Fitting Using MCEM
Parallel Matrix Factorization
• Partition data into m partitions
• For each partition run MCEM algorithm and get .
•
• Ensemble runs: for k = 1, … , n– Repartition data into m partitions with a new seed– Run E-step only job for each partition given
• Average over user/item factors for all partitions and k’s to obtain the final estimate
One Map-Reduce
job
Parallel Matrix Factorization
• Partition data into m partitions
• For each partition run MCEM algorithm and get .
•
• Ensemble runs: for k = 1, … , n– Repartition data into m partitions with a new seed– Run E-step only job for each partition given
• Average over user/item factors for all partitions and k’s to obtain the final estimate
Each ensemble
run is a Map-Reduce
job
Key Points
• Partitioning is tricky!– By events? By items? By users?
• Empirically, “divide and conquer” + average over to obtain work well!
• Ensemble runs: After obtained , we run n E-step-only jobs and take average, for each job using a different user-item mix.
Identifiability Issues (MCEM-ARSID)
• Same log-likelihood can be achieved by– g ( ) = g ( ) + r, h ( ) = h ( ) – r
• Center α, β, u to zero-mean every E-step
– u = -u, v = -v• Constrain v to be positive
– Switching u.1, v.1 with u.2, v.2• ui ~ N(G(xi) , I), vj ~ N(H(xj), λI)
• Constraint: Diagonal entries λ1 >= λ2 >= …
MovieLens 1M Data
• 75% training and 25% test split by time
• Imbalanced data– User rating = 1: Positive
– User rating = 2, 3, 4, 5: Negative
– 5% positive rate
• Balanced data– User rating = 1, 2, 3: Positive
– User rating = 4, 5: Negative
– 44% positive rate
Big difference between VAR and ARS for imbalanced data!
Matrix Factorization For User Profile
• Offline user profile building period, obtain the user factor for user i
• Online modeling using OLR– If a user has a profile (warm-start), use as the
user covariates
– If not (cold-start), use as the user covariates
Offline Evaluation Metric Related to
Clicks
• For model M and J live items (articles) at any time
• If M = random (constant) modelE[S(M)] = #clicks
• Unbiased estimate of expected total clicks (Langford et al. 2008)
Experiments on Big Data
• Yahoo! Front Page Today Module data
• Data for building user profile: 8M users with at least 10 clicks (heavy users) in June 2011, 1B events
• Data for training and testing OLR model: Random served data with 2.4M clicks in July 2011
• Heavy users contributed around 30% of clicks
• User covariates / features for OLR:– Intercept-only (MOST POPULAR)
– 124 Behavior targeting features (BT-ONLY)
– BT + top 1000 clicked article ids (ITEM-PROFILE)
– BT + user profile with CTR on 43 binary content categories (CATEGORY-PROFILE)
– BT + profiles from matrix factorization models
Click Lift Performance For Different
User Profiles
Warm Start: Users with at least one sample in training data
Cold Start: Users with no data in training data
Structure of This Tutorial
• Part I: Introduction to Map-Reduce and the HadoopSystem– Overview of Distributed Computing
– Introduction to Map-Reduce
– Introduction to the Hadoop System
– The Pig Language
– A Deep Dive of Hadoop Map-Reduce
• Part II: Examples of Statistical Computing for Big Data– Bag of Little Bootstraps
– Large Scale Logistic Regression
– Parallel Matrix Factorization
• Part III: The Future of Cloud Computing
Spark
• An open source cluster computing system that works with Hadoop HDFS developed in UC Berkeley AMPLab
• In-memory cluster computing
• Better than Hadoop for iterative algorithms and interactive data mining
• Can be 100x faster than Hadoop Map-Reduce for some tasks
• Code in Scala – easy to write
• http://spark-project.org/
Logistic Regression in Spark vs Hadoop
Gradient Descent for Logistic
Regression
Iterative Process in Hadoop
Disk Mappers Disk Reducers
DiskMappersDiskReducers
Disk Mappers Disk Reducers
Iterative Process in Spark
(Gradient Descent Code)
Data In
DiskMappers
Memory
Reducersgradients
Mappers Reducers
Mappers Reducers
gradients
gradients
GraphLab
• An open-source graph-based, high performance, distributed computation framework in C++
• http://graphlab.org/
• HDFS integration
• Major design– Sparse data with local dependencies
– Iterative algorithms
– Potentially asynchronous execution among nodes
GraphLab
• Graph-parallel
• Map-Reduce: computation applied to independent records
• GraphLab: dependent records stored as vertices in a large distributed data-graph
• Computation in parallel on each vertex and can interact with neighboring vertices
Example: PageRank for Web Pages
Interest: Probability
of landing on a page
by random clicking
***New Slide!
Example: PageRank
• R[i] = Stationary probability of Node i
• 1 - α = Probability of people stop clicking at any page
***New Slide!
Example: PageRank
Good For…
• Bilinear random effect models (matrix factorization in collaborative filtering)
• Clustering
• Graphical models
• Topic modeling
• Graph analytics
• …
GraphX
• Combines the advantages of both data-parallel and graph-parallel systems
• Distributed graph computation on Spark
BibliographyAgarwal, D. and Chen, B. (2009). Regression-based latent factor models. In Proceedings of the
15th ACM SIGKDD international conference on Knowledge discovery and data mining, 19–28.
ACM.
Agarwal, D., Chen, B., and Elango, P. (2010). Fast online learning through offline initialization for
time-sensitive recommendation. In Proceedings of the 16th ACM SIGKDD international
conference on Knowledge discovery and data mining, 703–712. ACM.
Bell, R., Koren, Y., and Volinsky, C. (2007). Modeling relationships at multiple scales to improve
accuracy of large recommender systems. In Proceedings of the 13th ACM SIGKDD international
conference on Knowledge discovery and data mining, 95–104. ACM.
Booth, J. G., & Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an
automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 61(1), 265-285.
Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and
statistical learning via the alternating direction method of multipliers. Foundations and Trends® in
Machine Learning, 3(1), 1-122.
Bickel, P. J., Götze, F., & van Zwet, W. R. (2012). Resampling fewer than n observations: gains,
losses, and remedies for losses (pp. 267-297). Springer New York.
Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters.
Communications of the ACM, 51(1), 107-113.
BibliographyEfron, B. (1979). Bootstrap methods: another look at the jackknife. The annals of Statistics, 1-26.
Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. (2012). The big data bootstrap. arXiv preprint
arXiv:1206.6415.
Khanna, R., Zhang, L., Agarwal, D. and Chen, B. (2012). Parallel Matrix Factorization for Binary
Response. In Arxiv.org.
Zhang, L., Agarwal, D., and Chen, B. (2011). Generalizing matrix factorization through flexible
regression priors. In Proceedings of the fifth ACM conference on Recommender systems, 13–20.
ACM.
Recommended