17
RedditR Your Personalized gateway to Reddit.com Aravind Kumar Ramesh Insight Data Engineering Fellow, New York

Insight Data Engineering Project

Embed Size (px)

Citation preview

RedditRYour Personalized gateway to

Reddit.com

Aravind Kumar RameshInsight Data Engineering Fellow, New York

Motivation

82.54 billion pageviews73.15 million submissions, 725.85 million comments

1

In 2015

What’s trending ? Maximize Content Engagement

Personalized recommendation

DEMO.

1,034,259.8MB of Reddit data

Data Pipeline

Challenges

◉ Restricted Reddit API

Solutions

◉Restricted Reddit API- A Multi-threaded API to scrape

Reddit

Challenges

◉Generating recommendations using ALS

Solutions

◉Generating recommendations using ALS

- ALS - Compute Intensive.

- Generating recommendations using user graph

Challenges

◉Dealing with large data

Use Parquet

Original Dataset1084.5 GBCompressed Parquet187.8 GB

Queries ran 3x faster on Parquet.

Solution

Table Design

PRIMARY KEY (author,created_utc))with clustering order by (created_utc asc)

Secondary IndexCREATE INDEX subreddit ON subredditinfo (subreddit);

I am Aravind I am here because I love data engineering and working with large scale data. You can find me @aravindk1992

About Me

Bachelor’s in Telecommunication Engineering Master’s in Computer Science from the State University of New York at Buffalo, New York

Any questions ?

Thanks!

Back up slides!

User Graph

USER A( POSTS A CONTENT ON

REDDIT )

User Graph

USER B( READS THE POST AND REPLIES TO THE POST )

USER A

User Graph

USER B USER A INTERACTION

Indegree: Influence

Outdegree: Activity

What are you mostly likely to like?

◉Look at the indegree of all the nodes in a cluster/subreddit and rank them.

◉For the top 10 nodes with highest indegree, compute outdegree to other cluseters

◉You are more likely to like what the most influential user of your favourite subredditengages with.