36
A Small Overview of Big Data Products, Analytics and Infrastructure at Linkedin Bhaskar Ghosh Senior Director of Engineering Data Infrastructure LinkedIn Confidential ©2013 All Rights Reserved Big Data Science A Symposium in Honor of Martin Schultz Yale University 26 Oct 2012

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Embed Size (px)

DESCRIPTION

This talk was given by Bhaskar Ghosh (Senior Director of Engineering, LinkedIn Data Infrastructure), at the Yale Oct 2012 Symposium on Big Data, in honor of Martin Schultz.

Citation preview

Page 1: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

A Small Overview of Big Data Products, Analytics and Infrastructure at Linkedin

Bhaskar Ghosh Senior Director of Engineering Data Infrastructure

LinkedIn Confidential ©2013 All Rights Reserved

Big Data Science A Symposium in Honor of Martin Schultz Yale University 26 Oct 2012

Page 2: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Outline

LinkedIn Confidential ©2013 All Rights Reserved 2

1. Martin and Me 2. Company and Mission 3. Products and Science 4. Data Infrastructure 5. P, S, DI: People You May Know 6. Linkedin + Yale 7. Conclusion

Page 3: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Martin and Me

LinkedIn Confidential ©2013 All Rights Reserved 3

Thank you Martin! Best mentor. Versatility, big-picture thinking and leadership. Yale CS Ph.D. 1995 (Parallel Algorithms)

12y @ Informix & Oracle building parallel database systems

4y @ Yahoo! building Ads systems & leading the Display Ads Exchange organization

2y+ @ LinkedIn building & leading the Data Infrastructure Engineering Organization

Page 4: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

The World’s Largest Professional Network

Members Worldwide 2 new

Members Per Second 100M+

Monthly Unique Visitors 175M+ 2M+

Company Pages

Connecting Talent Opportunity. At scale…

LinkedIn Confidential ©2013 All Rights Reserved 4

Page 5: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

..and a bunch of Data-Driven Products

LinkedIn Confidential ©2013 All Rights Reserved 5

Pandora Search for People

Events You May Be Interested In

Groups browse maps

Page 6: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

The LinkedIn Mission. Connect the world’s professionals to make them more productive and successful

Page 7: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Linkedin Product Philosophy

LinkedIn Confidential ©2013 All Rights Reserved 7

Goals

Approach

Provide a uniquely personalized experience to

members (professionals)

Build an ecosystem to balance the interests of

members and partners (companies)

Launch Often and Early

Data-Driven Experiment and Test

Fail Fast

Prepare for Virality and Scale

Page 8: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Two Product Families

LinkedIn Confidential ©2013 All Rights Reserved 8

Data

Data Infrastructure

Science and Analytics

Professionals Companies

Connections

Profiles Actions

Content

For Members For Partners People You May Know Who’s Viewed My Profile Jobs You May Be

Interested In News/Sharing Today Search Subscriptions

Hire

Market

Sell

Page 9: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

The Big-Data Feedback Loop

LinkedIn Confidential ©2013 All Rights Reserved 9

Value ↑

Insights ↑

Scale ↑

Product

Science Data

Member

Engagement ↑

Virality ↑

Signals ↑

Refinement ↑

Infrastructure Analytics ↑

Page 10: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 10

Product Family Products Science

Identity and Engagement

Search and Analysis

Recommendations

Monetization

1. Profile and Connections 2. Activity Streams 3. Messages (email) 4. Endorsements & Skills

Blending and ranking of heterogeneous content (e.g. Network Updates, Group Discussions, Job Postings)

1. People Search 2. Group Search 3. Who Viewed My Profile

1. People You May Know 2. Jobs You May Be

Interested In 3. Events You May Be

Interested In

Entity disambiguation and matching

1. Subscription Packages 2. Sponsored Content

Response Prediction Inventory Forecasting

Data Infra

Member-Facing Products: Diversity at Scale

Page 11: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Recommendations…Are Effective .. And Drive

LinkedIn Confidential ©2013 All Rights Reserved 11

> 50% of connections

> 50% of job applications > 50% of group joins

• Find data that is useful for Members • Guiding Principle

• Provide Relevant Content • Establish Social Connections • In Appropriate Context

Page 12: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Behavior Analysis

Collaborative Filtering Popularity

Sim

ilar P

rofil

es

Ref

erra

l Cen

ter

Tale

ntM

atch

Peop

le B

row

se

Map

People

Recom- mendation Types

Shared, Dynamic, Unified Core Service

Products

Recom- mendation Entities

Jobs

Bro

wse

M

ap

Sim

ilar J

obs

Jobs

Jobs

You

May

be

inte

rest

ed in

… Ads Companies Searches News Events … and more

GYM

L

Gro

ups

Br

owse

Map

Groups

Sim

ilar G

roup

s

User Feedback

API

(R-T) Feature Extraction, Entity Resolution & Enrichment

(R-T) matching computations

A/B

Offline data munging (hadoop)

LinkedIn Recommendation Engine

Page 13: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 13

Product Family Products Science

Identity and Engagement

Search and Analysis

Recommendations

Monetization

1. Profile and Connections 2. Activity Streams 3. Messages (email) 4. Endorsements & Skills

Blending and ranking of heterogeneous content (e.g. Network Updates, Group Discussions, Job Postings)

1. People Search 2. Group Search 3. Who Viewed My Profile

1. People You May Know 2. Jobs You May Be

Interested In 3. Events You May Be

Interested In

Entity disambiguation and matching

1. Subscription Packages 2. Sponsored Content Response prediction

Data Infra

• Scale • Full text and

secondary ind • Real-time

• Faceted search • Near RT index

freshness • Drill-down

exploration

• Graph analysis • Content serving • Real-time tuning

Member-Facing Products: Diversity at Scale

Page 14: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

LinkedIn Data Infrastructure: Three-Phase Abstraction

LinkedIn Confidential ©2013 All Rights Reserved 14

Users Online Data Infra

Near-Line Infra

Application Offline Data Infra

Infrastructure Latency & Freshness Requirements Products

Online Activity that should be reflected immediately • Member Profiles • Company Profiles • Connections

• Messages • Endorsements • Skills

Near-Line Activity that should be reflected soon • Activity Streams • Profile Standardization • News

• Recommendations • Search • Messages

Offline Activity that can be reflected later • People You May Know • Connection Strength • News

• Recommendations • Next best idea…

Page 15: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

LinkedIn Data Infrastructure: Sample Stack

15

Infra challenges in 3-phase ecosystem are diverse, complex and specific

Some off-the-shelf. Significant investment in home-grown, deep and

interesting platforms

Page 16: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

LinkedIn Data Infrastructure: Data Stores

LinkedIn Confidential ©2013 All Rights Reserved 16

Users Online Data Infra

Near-Line Infra

Application Offline Data Infra

Systems Capabilities

Transactions Rich structures (e.g. indexes) Change capture capability Key value / document storage

Voldemort

ICDE 2012 (Data Infra Overview) FAST 2012 (Voldemort for Serving)

Page 17: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

LinkedIn Data Infrastructure: Specialized Indexes

LinkedIn Confidential ©2013 All Rights Reserved 17

Users Online Data Infra

Near-Line Infra

Application Offline Data Infra

Systems Capabilities

Search platform

Distributed graph engine Zoie Bobo Sensei

GraphDB

Page 18: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

LinkedIn Data Infrastructure: Pipelines

LinkedIn Confidential ©2013 All Rights Reserved 18

Users Online Data Infra

Near-Line Infra

Application Offline Data Infra

Systems Capabilities

Messaging for site events, monitoring High throughput

Change data capture stream Reliable, consistent, low latency pipe

ACM SOCC 2012: “Databus” IEEE Data Eng. Bulletin 2012: “Kafka”

Page 19: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

LinkedIn Data Infrastructure: Off-line Analysis

LinkedIn Confidential ©2013 All Rights Reserved 19

Users Online Data Infra

Near-Line Infra

Application Offline Data Infra

Systems Capabilities

ML, Ranking, Relevance Insights and Analytics ETL, Metadata and Pipes Business Source of Truth

Page 20: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

LinkedIn Data Infrastructure: Cluster Management

LinkedIn Confidential ©2013 All Rights Reserved 20

Users Online Data Infra

Near-Line Infra

Application Offline Data Infra

Systems Capabilities

Generic framework for building distributed systems

Cluster Management Primitives

ACM SOCC 2012: Untangling Cluster Management with Helix

Page 21: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

HELIX: Generalizing Cluster Management

LinkedIn Confidential ©2013 All Rights Reserved 21

STATE MACHINE

CONSTRAINTS OBJECTIVE

COUNT=2

COUNT=1

minimize(maxnj∈N S(nj) )

t1≤ 5 S

M O

t1 t2

t3 t4

minimize(maxnj∈N M(nj) )

Helix

Declare distributed system behavior via {S, C, O} Enforce Partition constraints Fault detection and tolerance (e.g. promote S to M) Elasticity (e.g. Re-balance; Minimize migrations)

Used in Espresso, Search, Databus

Page 22: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

LinkedIn Data Infrastructure: A few take-aways

LinkedIn Confidential ©2013 All Rights Reserved 22

1. Infrastructure decisions matter and are hard to transform in a hyper-growth environment.

2. Balance open-source products with home-grown platforms (**)

3. Operability, Capacity Planning and On-line Multi-tenancy are hard

4. Data Movement: Pipes and Feedback Loops are critical (**)

5. Data Model and Integration e2e are key (*) 6. Few vs Many: Balance over-specialized (agile)

vs generic efforts (leverage-able) platforms (*) 7. Off-line Multi-Platform story is evolving.

Page 23: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Science and Infrastructure: Giving Back

LinkedIn Confidential ©2013 All Rights Reserved 23

Research Publications

ACM SOCC 2012 ACM RecSys 2012 SIGIR 2012 CIKM 2012 VLDB 2012 ICDE 2012 FAST 2012 NetDB 2011 …

Open Source Projects

Apache Helix new

ParSeq new

DataFu new

Apache Kafka

Sensei

Azkaban

Voldemort

Page 24: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

A Recommendation Product:

LinkedIn Confidential ©2013 All Rights Reserved 24

People You May Know (PYMK)

Page 25: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Probability that you may know someone else?

LinkedIn Confidential ©2013 All Rights Reserved 25

Bob

Alice

Carol

Known as “triangle closing”

??

Page 26: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

PYMK: Science, Members and Connections

LinkedIn Confidential ©2013 All Rights Reserved 26

1) Feature selection is key Common Connections Geo Company Age

2) ML and data model • Traditional ML (e.g. matrix factorization) on O(n^2) of 175M

tend to not scale easily 3) Interplay: Data Model + ML + Parallel Computation model 4) Adding edges: Why do it?

• Creates positive-feedback social loops for members • More useful content and activity available to members • Denser graph improves signal strength in science-driven

products

Virality ↑

Value ↑

Insights ↑

Product

Science Data

Member

Signals ↑

The Feedback Loop

Page 27: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

PYMK: Off-line Model Build

LinkedIn Confidential ©2013 All Rights Reserved 27

Users Online Data Infra

Near-Line Infra

Application Offline Data Infra

Use generic off-line Infra (Hadoop and Pig) to build recommendations off-line. Very complex workflow due to extraction and selection of large num of features.

Built Azkaban for Hadoop. Small Input and final look-up structure but large intermediate data (100’s of TB)

due to MR. Problem (who you do not know) itself has an inherent blow-up. Special optimizations (e.g. Bloom Join to remove connected)

Page 28: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

PYMK: Off-line to Near-Line Serving

LinkedIn Confidential ©2013 All Rights Reserved 28

Users Online Data Infra

Near-Line Infra

Application Offline Data Infra

Build serving structure on Hadoop. Scan versus Index compactness tradeoff. Voldemort: Partitioned k-v; Load-balancing; Pluggable storage layer; Failover. Bulk load for efficiency. Fast Rollback for safety. Atomic swap. Serving: Per-partition index in memory. PYMK blobs on disk. Retrieval ~msec. Decoration in App FE is more expensive.

Page 29: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

PYMK: Science and Feedback Loop

LinkedIn Confidential ©2013 All Rights Reserved 29

Users Online Data Infra

Near-Line Infra

Application Offline Data Infra

Response vs Latency: Fast refresh helps user experience. (e.g. showing connections of very recent connections). “Social” phenomenon.

Very agile feature: Lots of on-line A/B testing and tweaking of features Huge Impact: > 50% of accepted invites are created by PYMK

Page 30: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

PYMK: Tying It All Together

LinkedIn Confidential ©2013 All Rights Reserved 30

P (B knows C) α large number of features

Distance

Common connections

Organizational Overlap

Age

Bob

Alice

Carol

Dave Eve

Offline Model

Near-Line Serving

Offline

Near-Line

User Interactions

PYMK Application

Page 31: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

LinkedIn + Yale

LinkedIn Confidential ©2013 All Rights Reserved 31

What is my career path? How can I prepare? How do I get my first

internship and first job?

Students

Where did my students go after they left the university?

How is my school seeding the various industries with the best talent?

How does my school compare with other institutions

Students: Transformation of

Careers Yale: Get a data-driven view Uncover opportunities

Wins based on data and insights

Page 32: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Thank you colleagues for the beautiful slides!

LinkedIn Confidential ©2013 All Rights Reserved 32

David Henke SVP Operations

Amy Tang Sr. Program Manager

Sam Shah Principal Engineer

Shirshanka Das Principal Engineer

Kapil Surlaker Principal Engineer

Anmol Bhasin Sr. Engineering Manager

Daniel Tunkelang Principal Data Scientist

Page 33: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Summary

LinkedIn Confidential ©2013 All Rights Reserved 33

Read more @ data.linkedin.com

1. E2E: The Big-Data feedback loop of social-network product design is cool 2. Infrastructure

1. Data Infrastructure needs continuous innovation and iteration to keep pace for scale and cost.

2. Fast moving, Big, Clean Data + Agile Metadata = Goodness 3. Data-driven products need agile feedback infrastructure and

measurement methodology. 3. Methodology

1. Data-Driven experimentation enables insights and agile products 2. Recommendation-driven products have big impact.

Page 34: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Help us. Come Have Fun with Us!

LinkedIn Confidential ©2013 All Rights Reserved 34

Info: data.linkedin.com

1. Science and Data Mining: Recommendation and Optimization Problems 2. Next-generation ad-hoc and OLAP query processing on Hadoop 3. Graph Computations: Off-line mining and On-line integration loops 4. nRT Data Streams in Near-line infrastructure 5. And much more…

Page 35: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

In Closing

LinkedIn Confidential ©2013 All Rights Reserved 35

[email protected]

Thank You!

Page 36: A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 36