Large-scale Recommender Systems on Just a PC LSRS 2013 keynote (RecSys ’13 Hong Kong) Aapo...
41
Large-scale Recommender Systems on Just a PC LSRS 2013 keynote (RecSys ’13 Hong Kong) Aapo Kyrölä Ph.D. candidate @ CMU http://www.cs.cmu.edu/~aky rola Twitter: @kyrpov Big Data – small machine
Large-scale Recommender Systems on Just a PC LSRS 2013 keynote (RecSys ’13 Hong Kong) Aapo Kyrölä Ph.D. candidate @ CMU akyrola
Large-scale Recommender Systems on Just a PC LSRS 2013 keynote
(RecSys 13 Hong Kong) Aapo Kyrl Ph.D. candidate @ CMU
http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov Big Data small
machine
Slide 2
My Background Academic: 5 th year Ph.D. @ Carnegie Mellon.
Advisors: Guy Blelloch, Carlos Guestrin (UW) Startup Entrepreneur
2009 2012 + Shotgun : Parallel L1-regularized regression solver
(ICML 2011). + Internships at MSR Asia (2011) and Twitter (2012)
Habbo : founded 2000
Slide 3
Outline of this talk 1.Why single-computer computing?
2.Introduction to graph computation and GraphChi 3.Recommender
systems with GraphChi 4.Future directions & Conclusion
Slide 4
Why on a single machine? Cant we just use the Cloud?
Large-Scale Recommender Systems on Just a PC
Slide 5
Why use a cluster? Two reasons: 1.One computer cannot handle my
problem in a reasonable time. 1.I need to solve the problem very
fast.
Slide 6
Why use a cluster? Two reasons: 1.One computer cannot handle my
problem in a reasonable time. 1.I need to solve the problem very
fast. Our work expands the space of feasible (graph) problems on
one machine: -Our experiments use the same graphs, or bigger, than
previous papers on distributed graph computation. (+ we can do
Twitter graph on a laptop) -Most data not that big. Our work raises
the bar on required performance for a complicated system.
Slide 7
Benefits of single machine systems Assuming it can handle your
big problems 1.Programmer productivity Global state Can use real
data for development 2.Inexpensive to install, administer, less
power. 3.Scalability.
Slide 8
Efficient Scaling Task 7Task 6Task 5Task 4Task 3Task 2Task 1
TimeT Distributed Graph System Single-computer system (capable of
big tasks) Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 TimeT
T11T10T9T8T7T6T5T4T3T2T1 6 machines 12 machines Task 1 Task 2 Task
3 Task 4 Task 5 Task 6 Task 10 Task 11 Task 12 (Significantly) less
than 2x throughput with 2x machines Exactly 2x throughput with 2x
machines
Slide 9
Slide 10
GRAPH COMPUTATION AND GRAPHCHI
Slide 11
Why graphs for recommender systems? Graph = matrix: edge(u,v) =
M[u,v] Note: always sparse graphs Intuitive, human-understandable
representation Easy to visualize and explain. Unifies collaborative
filtering (typically matrix based) with recommendation in social
networks. Random walk algorithms. Local view vertex-centric
computation
Slide 12
Vertex-Centric Computational Model Graph G = (V, E) directed
edges: e = (source, destination) each edge and vertex associated
with a value (user- defined type) vertex and edge values can be
modified (structure modification also supported) Data 12 GraphChi
Aapo Kyrola A A B B
Slide 13
Data Vertex-centric Programming Think like a vertex Popularized
by the Pregel and GraphLab projects MyFunc(vertex) { // modify
neighborhood } Data
Slide 14
What is GraphChi Both in OSDI12!
Slide 15
The Main Challenge of Disk-based Graph Computation: Random
Access ~ 100K reads / sec (commodity) ~ 1M reads / sec (high-end
arrays)
Performance GraphChi can compute on the full Twitter
follow-graph with just a standard laptop. ~ as fast as a very large
Hadoop cluster! (size of the graph Fall 2013, > 20B edges [Gupta
et al 2013])
Slide 19
GraphChi is Open Source C++ and Java-versions in GitHub:
http://github.com/graphchi http://github.com/graphchi Java-version
has a Hadoop/Pig wrapper. If you really really want to use
Hadoop.
Slide 20
RECSYS MODEL TRAINING WITH GRAPHCHI
Slide 21
Overview of Recommender Systems for GraphChi Collaborative
Filtering toolkit (next slide) Link prediction in large networks
Random-walk based approaches (Twitter) Talk on Wednesday.
Slide 22
GraphChis Collaborative Filtering Toolkit Developed by Danny
Bickson (CMU / GraphLab Inc) Includes: Alternative Least Squares
(ALS) Sparse-ALS SVD++ LibFM (factorization machines) GenSGD
Item-similarity based methods PMF CliMF (contributed by Mark Levy)
. Note: In the C++ -version. Java-version in development by a CMU
team. See Dannys blog for more information:
http://bickson.blogspot.com /2012/12/collaborative-
filtering-with-graphchi.html
Slide 23
TWO EXAMPLES: ALS AND ITEM-BASED CF
Slide 24
Example: Alternative Least Squares Matrix Factorization (ALS)
Reference: Y. Zhou, D. Wilkinson, R. Schreiber, R. Pan: Large-Scale
Parallel Collaborative Filtering for the Netflix Prize (2008) Task:
Predict ratings for items (movies) by users. Model: Latent factor
model (see next slide)
Slide 25
ALS: Product Item bipartite graph City of God Wild Strawberries
The Celebration La Dolce Vita Women on the Verge of a Nervous
Breakdown 4 3 2 5 0.42.3-1.82.91.2 -3.22.80.90.24.1
8.72.90.042.13.141 2.32.53.90.020.04 Users rating of a movie
modeled as a dot-product: Users rating of a movie modeled as a
dot-product:
Slide 26
ALS: GraphChi implementation Update function handles one vertex
a time (user or movie) For each user: Estimate latent(user):
minimize least squares of dot- product predicted ratings GraphChi
executes the update function for each vertex (in parallel), and
loads edges (ratings) from disk Latent factors in memory: need O(V)
memory. If factors dont fit in memory, can replicate to edges. and
thus store on disk Scales to very large problems!
Slide 27
ALS: Performance Matrix Factorization (Alternative Least
Squares) Remark: Netflix is not a big problem, but GraphChi will
scale at most linearly with input size (ALS is CPU bounded, so
should be sub-linear in #ratings).
Slide 28
Example: Item Based-CF Task: compute a similarity score [e,g.
Jaccard] for each movie-pair that has at least one viewer in
common. Similarity(X, Y) ~ # common viewers Output top K similar
items for each item to a file. or: create edge between X, Y
containing the similarity. Problem: enumerating all pairs takes too
much time.
Slide 29
City of God Wild Strawberries The Celebration La Dolce Vita
Women on the Verge of a Nervous Breakdown 3 Solution: Enumerate all
triangles of the graph. New problem: how to enumerate triangles if
the graph does not fit in RAM?
PIVOTS Algorithm: Let pivots be a subset of the vertices; Load
all neighbor-lists (adjacency lists) of pivots into RAM Use now
GraphChi to load all vertices from disk, one by one, and compare
their adjacency lists to the pivots adjacency lists (similar to
merge). Repeat with a new subset of pivots.
Slide 32
Triangle Counting Performance Triangle Counting
Slide 33
FUTURE DIRECTIONS & FINAL REMARKS
Slide 34
Single-Machine Computing in Production? GraphChi supports
incremental computation with dynamic graphs: Can keep on running
indefinitely, adding new edges to the graph Constantly fresh model.
However, requires engineering not included in the toolkit. Compare
to a cluster-based system (such as Hadoop) that needs to compute
from scratch.
Slide 35
Efficient Scaling Businesses need to compute hundreds of
distinct tasks on the same graph Example: personalized
recommendations. Parallelize each task Parallelize across tasks
Task
Slide 36
Single Machine vs. Cluster Most Big Data computations are
I/O-bound Single machine: disk bandwidth + seek latency Distributed
memory: network bandwidth + network latency Complexity /
challenges: Single machine: algorithms and data structures that
reduce random access Distributed: admin, coordination, consistency,
fault tolerance Total cost Programmer productivity Specialized vs.
Generalized frameworks
Slide 37
Unified Recsys Platform for GraphChi? Working with masters
students at CMU. Goal: ability to easily compare different
algorithms, parameters Unified input, output. General programmable
API (not just file-based) Evaluation process: Several evaluation
metrics; Cross- validation, held-out data Run many algorithm
instances in parallel, on same graph. Java. Scalable from the
get-go.
Slide 38
Slide 39
Slide 40
Recent developments: Disk-based Graph Computation Recently two
disk-based graph computation systems published: TurboGraph (KDD13)
X-Stream (SOSP13 in October) Significantly better performance than
GraphChi on many problems Avoid preprocessing (sharding) But
GraphChi can do some computation that X- Stream cannot (triangle
counting and related); TurboGraph requires SSD Hot research
area!
Slide 41
Do you need GraphChi or any system? Heck, for many algorithms,
you can just mmap() over your (binary) adjacency list / sparse
matrix, and write a for-loop. See Lin, Chau, Kang Leveraging Memory
Mapping for Fast and Scalable Graph Computation on a PC (Big Data
13)Leveraging Memory Mapping for Fast and Scalable Graph
Computation on a PC Obviously good to have a common API And some
algos need more advanced solutions (like GraphChi, X-Stream,
TurboGraph) Beware of the hype!
Slide 42
Conclusion Very large recommender algorithms can now be run on
just your PC or laptop. Additional performance from multi-core
parallelism. Great for productivity scale by replicating. In
general, good single machine scalability requires care with data
structures, memory management natural with C/C++, with Java (etc.)
need low- level byte massaging. Frameworks like GraphChi hide the
low-level. More work needed to productize current work.