Large-scale Recommender Systems on Just a PC LSRS 2013 keynote (RecSys ’13 Hong Kong) Aapo Kyrölä Ph.D. candidate @ CMU akyrola

Large-scale Recommender Systems on Just a PC LSRS 2013 keynote (RecSys 13 Hong Kong) Aapo Kyrl Ph.D. candidate @ CMU http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov Big Data small machine

My Background Academic: 5 th year Ph.D. @ Carnegie Mellon. Advisors: Guy Blelloch, Carlos Guestrin (UW) Startup Entrepreneur 2009 2012 + Shotgun : Parallel L1-regularized regression solver (ICML 2011). + Internships at MSR Asia (2011) and Twitter (2012) Habbo : founded 2000

Outline of this talk 1.Why single-computer computing? 2.Introduction to graph computation and GraphChi 3.Recommender systems with GraphChi 4.Future directions & Conclusion

Why on a single machine? Cant we just use the Cloud? Large-Scale Recommender Systems on Just a PC

Why use a cluster? Two reasons: 1.One computer cannot handle my problem in a reasonable time. 1.I need to solve the problem very fast.

Why use a cluster? Two reasons: 1.One computer cannot handle my problem in a reasonable time. 1.I need to solve the problem very fast. Our work expands the space of feasible (graph) problems on one machine: -Our experiments use the same graphs, or bigger, than previous papers on distributed graph computation. (+ we can do Twitter graph on a laptop) -Most data not that big. Our work raises the bar on required performance for a complicated system.

Benefits of single machine systems Assuming it can handle your big problems 1.Programmer productivity Global state Can use real data for development 2.Inexpensive to install, administer, less power. 3.Scalability.

Efficient Scaling Task 7Task 6Task 5Task 4Task 3Task 2Task 1 TimeT Distributed Graph System Single-computer system (capable of big tasks) Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 TimeT T11T10T9T8T7T6T5T4T3T2T1 6 machines 12 machines Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 10 Task 11 Task 12 (Significantly) less than 2x throughput with 2x machines Exactly 2x throughput with 2x machines

GRAPH COMPUTATION AND GRAPHCHI

Why graphs for recommender systems? Graph = matrix: edge(u,v) = M[u,v] Note: always sparse graphs Intuitive, human-understandable representation Easy to visualize and explain. Unifies collaborative filtering (typically matrix based) with recommendation in social networks. Random walk algorithms. Local view vertex-centric computation

Vertex-Centric Computational Model Graph G = (V, E) directed edges: e = (source, destination) each edge and vertex associated with a value (user- defined type) vertex and edge values can be modified (structure modification also supported) Data 12 GraphChi Aapo Kyrola A A B B

Data Vertex-centric Programming Think like a vertex Popularized by the Pregel and GraphLab projects MyFunc(vertex) { // modify neighborhood } Data

What is GraphChi Both in OSDI12!

The Main Challenge of Disk-based Graph Computation: Random Access ~ 100K reads / sec (commodity) ~ 1M reads / sec (high-end arrays)

Performance GraphChi can compute on the full Twitter follow-graph with just a standard laptop. ~ as fast as a very large Hadoop cluster! (size of the graph Fall 2013, > 20B edges [Gupta et al 2013])

GraphChi is Open Source C++ and Java-versions in GitHub: http://github.com/graphchi http://github.com/graphchi Java-version has a Hadoop/Pig wrapper. If you really really want to use Hadoop.

RECSYS MODEL TRAINING WITH GRAPHCHI

Overview of Recommender Systems for GraphChi Collaborative Filtering toolkit (next slide) Link prediction in large networks Random-walk based approaches (Twitter) Talk on Wednesday.

GraphChis Collaborative Filtering Toolkit Developed by Danny Bickson (CMU / GraphLab Inc) Includes: Alternative Least Squares (ALS) Sparse-ALS SVD++ LibFM (factorization machines) GenSGD Item-similarity based methods PMF CliMF (contributed by Mark Levy) . Note: In the C++ -version. Java-version in development by a CMU team. See Dannys blog for more information: http://bickson.blogspot.com /2012/12/collaborative- filtering-with-graphchi.html

TWO EXAMPLES: ALS AND ITEM-BASED CF

Example: Alternative Least Squares Matrix Factorization (ALS) Reference: Y. Zhou, D. Wilkinson, R. Schreiber, R. Pan: Large-Scale Parallel Collaborative Filtering for the Netflix Prize (2008) Task: Predict ratings for items (movies) by users. Model: Latent factor model (see next slide)

ALS: Product Item bipartite graph City of God Wild Strawberries The Celebration La Dolce Vita Women on the Verge of a Nervous Breakdown 4 3 2 5 0.42.3-1.82.91.2 -3.22.80.90.24.1 8.72.90.042.13.141 2.32.53.90.020.04 Users rating of a movie modeled as a dot-product: Users rating of a movie modeled as a dot-product:

ALS: GraphChi implementation Update function handles one vertex a time (user or movie) For each user: Estimate latent(user): minimize least squares of dot- product predicted ratings GraphChi executes the update function for each vertex (in parallel), and loads edges (ratings) from disk Latent factors in memory: need O(V) memory. If factors dont fit in memory, can replicate to edges. and thus store on disk Scales to very large problems!

ALS: Performance Matrix Factorization (Alternative Least Squares) Remark: Netflix is not a big problem, but GraphChi will scale at most linearly with input size (ALS is CPU bounded, so should be sub-linear in #ratings).

Example: Item Based-CF Task: compute a similarity score [e,g. Jaccard] for each movie-pair that has at least one viewer in common. Similarity(X, Y) ~ # common viewers Output top K similar items for each item to a file. or: create edge between X, Y containing the similarity. Problem: enumerating all pairs takes too much time.

City of God Wild Strawberries The Celebration La Dolce Vita Women on the Verge of a Nervous Breakdown 3 Solution: Enumerate all triangles of the graph. New problem: how to enumerate triangles if the graph does not fit in RAM?

Enumerating Triangles (Item-CF) Triangles with edge (u, v) = intersection(neighbors(u), neighbors(v)) Iterative memory efficient solution (next slide)

PIVOTS Algorithm: Let pivots be a subset of the vertices; Load all neighbor-lists (adjacency lists) of pivots into RAM Use now GraphChi to load all vertices from disk, one by one, and compare their adjacency lists to the pivots adjacency lists (similar to merge). Repeat with a new subset of pivots.

Triangle Counting Performance Triangle Counting

FUTURE DIRECTIONS & FINAL REMARKS

Single-Machine Computing in Production? GraphChi supports incremental computation with dynamic graphs: Can keep on running indefinitely, adding new edges to the graph Constantly fresh model. However, requires engineering not included in the toolkit. Compare to a cluster-based system (such as Hadoop) that needs to compute from scratch.

Efficient Scaling Businesses need to compute hundreds of distinct tasks on the same graph Example: personalized recommendations. Parallelize each task Parallelize across tasks Task

Single Machine vs. Cluster Most Big Data computations are I/O-bound Single machine: disk bandwidth + seek latency Distributed memory: network bandwidth + network latency Complexity / challenges: Single machine: algorithms and data structures that reduce random access Distributed: admin, coordination, consistency, fault tolerance Total cost Programmer productivity Specialized vs. Generalized frameworks

Unified Recsys Platform for GraphChi? Working with masters students at CMU. Goal: ability to easily compare different algorithms, parameters Unified input, output. General programmable API (not just file-based) Evaluation process: Several evaluation metrics; Cross- validation, held-out data Run many algorithm instances in parallel, on same graph. Java. Scalable from the get-go.

Recent developments: Disk-based Graph Computation Recently two disk-based graph computation systems published: TurboGraph (KDD13) X-Stream (SOSP13 in October) Significantly better performance than GraphChi on many problems Avoid preprocessing (sharding) But GraphChi can do some computation that X- Stream cannot (triangle counting and related); TurboGraph requires SSD Hot research area!

Do you need GraphChi or any system? Heck, for many algorithms, you can just mmap() over your (binary) adjacency list / sparse matrix, and write a for-loop. See Lin, Chau, Kang Leveraging Memory Mapping for Fast and Scalable Graph Computation on a PC (Big Data 13)Leveraging Memory Mapping for Fast and Scalable Graph Computation on a PC Obviously good to have a common API And some algos need more advanced solutions (like GraphChi, X-Stream, TurboGraph) Beware of the hype!

Conclusion Very large recommender algorithms can now be run on just your PC or laptop. Additional performance from multi-core parallelism. Great for productivity scale by replicating. In general, good single machine scalability requires care with data structures, memory management natural with C/C++, with Java (etc.) need low- level byte massaging. Frameworks like GraphChi hide the low-level. More work needed to productize current work.

Thank you! Aapo Kyrl Ph.D. candidate @ CMU soon to graduate! (Currently visiting U.W) http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov

Documents

Large-scale Recommender Systems on Just a PC LSRS 2013 keynote (RecSys ’13 Hong Kong) Aapo Kyrölä Ph.D. candidate @ CMU akyrola