Data Infrastructure at LinkedIn

Preview:

DESCRIPTION

This talk was given by Jun Rao (Staff Software Engineer at LinkedIn) and Sam Shah (Senior Engineering Manager at LinkedIn) at the Analytics@Webscale Technical Conference (June 2013).

Citation preview

LinkedIn Confidential ©2013 All Rights Reserved

Data Infrastructure at Linkedin

Jun Rao and Sam Shah

LinkedIn Confidential ©2013 All Rights Reserved 2

Outline

1. LinkedIn introduction

2. Online/nearline infrastructure overview

3. Infrastructure for data mining

4. Conclusion

The World’s Largest Professional Network

Members Worldwide

2 newMembers Per Second

100M+Monthly Unique Visitors

200M+ 2M+ Company Pages

Connecting Talent Opportunity. At scale…

LinkedIn Confidential ©2013 All Rights Reserved 3

4

Member ProfilesLarge dataset

Medium writes

Very high reads

Freshness <1s

5

People You May KnowLarge dataset

Compute intensive

High reads

Freshness ~hrs

6

LinkedIn Today Moving dataset

High writes

High reads

Freshness ~mins

LinkedIn Confidential ©2013 All Rights Reserved 7

The Big-Data Feedback Loop

Value

Insights

Scale

Product

ScienceData

Member

Engagement

Virality

Signals

Refinement

InfrastructureAnalytics

LinkedIn Confidential ©2013 All Rights Reserved 8

LinkedIn Data Infrastructure: Three-Phase Abstraction

Users Online Data Infra

Near-Line Infra

Application Offline Data Infra

Infrastructure Latency & Freshness Requirements Products

Online Activity that should be reflected immediately• Member Profiles• Company Profiles• Connections

• Messages • Endorsements• Skills

Near-Line Activity that should be reflected soon

• Activity Streams• Profile Standardization• News

• Recommendations• Search• Messages

Offline Activity that can be reflected later

• People You May Know• Connection Strength• News

• Recommendations• Next best idea…

9

LinkedIn Data Infrastructure: Sample Stack

Infra challenges in 3-phase ecosystem are diverse, complex and specific

Some off-the-shelf.Significant investment in home-grown, deep and

interesting platforms

Databus

10

Streaming Transactions

Databus : Timeline-Consistent Change Data Capture

LinkedIn Data Infrastructure Solutions

Databus at LinkedIn

12

DB

Bootstrap

CaptureChanges

On-lineChanges

On-lineChanges

DB

Compressed

Delta Since T

Consistent

Snapshot at U

Transport independent of data source: Oracle, MySQL, …

Transactional semantics In order, at least once delivery

Tens of relays Hundreds of sources Low latency - milliseconds

Consumer 1

Consumer n

Client

Dat

abus

C

lient

Lib

Consumer 1

Consumer n

Dat

abus

C

lient

Lib

Client

Relay

Event Win

13

Scaling Core Databases

RO

RO

RO

14

Voldemort: Highly-Available Distributed KV Store

LinkedIn Data Infrastructure Solutions

• Pluggable components• Tunable consistency /

availability• Key/value model,

server side “views”

• 10 clusters, 100+ nodes• Largest cluster – 10K+ qps• Avg latency: 3ms• Hundreds of Stores• Largest store – 2.8TB+

Voldemort: Architecture

16

Streaming Non-transactional Events

Offline

Nearline

Processing

17

Kafka: High-Volume Low-Latency Messaging System

LinkedIn Data Infrastructure Solutions

Kafka Architecture

Producer

Consumer

Producer

Consumer

Zookeeper

topic1-part1

topic2-part2

topic2-part1

topic1-part2

topic2-part2

topic2-part1

topic1-part1 topic1-part2

topic1-part1 topic1-part2

topic2-part2

topic2-part1

Broker 1 Broker 2 Broker 3 Broker 4

Key features• Scale-out architecture• High throughput• Automatic load balancing• Intra-cluster replication

Per day stats• writes: 10+ billion messages• reads: 50+ billion messages

19

Filling in the Data Store Gap

Text Search

20

Espresso: Indexed Timeline-Consistent Distributed Data Store

LinkedIn Data Infrastructure Solutions

Application View

21

Hierarchical data model

Rich functionality on resources Conditional updates Partial updates Atomic counters

Rich functionality withinresource groups

Transactions Secondary index Text search

22

Espresso: System Components

• Partitioning/replication• Timeline consistency• Change propagation

23

Generic Cluster Manager: Helix

• Generic Distributed State Model• Config Management• Automatic Load Balancing• Fault tolerance• Cluster expansion and rebalancing

• Espresso, Databus and Search• Open Source Apr 2012• https://github.com/linkedin/helix

Infrastructure challenges in large-scale data mining

Putting it together

Top complaints from data scientists

1 Getting the data in (Ingress ETL)

2 Getting the data out (Egress)

3 Workflow management

4 Model of computation

5 …

Top complaints from data scientists

1 Getting the data in (Ingress ETL)

2 Getting the data out (Egress)

3 Workflow management

4 Model of computation

5 …

LinkedIn Confidential ©2013 All Rights Reserved 27

LinkedIn circa 2010

O(n2) data integration complexity

Infrastructure fragility

• Can’t get all data• Hard to operate• Multi-hour delay• Labor intensive• Slow• Does it work?

Process fragility

• Labor intensive• One man’s

cleaning…

Data model

{ tracking_code=null, session_id=42, tracking_time=Tue Jul 31 07:27:25 PDT 2010, error_key=null, locale=en_us, browser_id=ddc61a81-5311-4859-be42-ca7dc7b941e3, member_id=1213, page_key=profile, tracking_info=Viewee=1214,lnl=f,nd=1,o=1214,^SP=pId-'pro_stars',rslvd=t,vs=v,vid=1214,ps=EDU|EXP|SKIL|, error_id=null, page_type=FULL_PAGE, request_path=view ...}

Data model (cont’d)

{ article_id=5560874437395353942, title=Five Good Reasons to Hire the Unemployed, language=en_US, article_source=bit.ly,url=aHR0cDovL3d3dy5vbmV0aGluZ25ldy5jb20vaW5kZXgucGhwL3dvcmsvMTAyLWZpdmUtZ29vZC1yZWFzb25zLXRvLWhpcmUtdGhlLXVuZW1wbG95ZWQK,...}

Problems

1 Data integration across systems

2 Fragile infrastructure

3 Lack of proper data models (ad-hoc)

LinkedIn Confidential ©2013 All Rights Reserved 34

LinkedIn 2013

O(n) data integration

Publish/subscribe commit log

Data model

Hundreds of message types Thousands of fields What do they all mean? What happens when they change?

Data model

1 Education

2 Push data cleanliness upstream

3 O(1) ETL

4 Evidence-based correctness

Data model

DDL for data definition and schema Central versioned registry of all schemas Schema review Programmatic compatibility model

– Schema changes handled transparently

Workflow

1 Check in schema

2 Code review

3 Ship

Seamless data load into downstream systems

Audit trail

Result: complete, verified copy of all data available

Top complaints from data scientists

1 Getting the data in (Ingress ETL)

2 Getting the data out (Egress)

3 Workflow management

4 Model of computation

5 …

Egress

store DATA into ‘kafka://…’ using Stream();

Top complaints from data scientists

1 Getting the data in (Ingress ETL)

2 Getting the data out (Egress)

3 Workflow management

4 Model of computation

5 …

46

Workflows

Job A

Job B

Job C

47

Workflows

Job A

Job B

Job C

Push to Production

48

Workflows

Job A

Job B

Job C

Push to Production

Job X

49

Workflows

Job A

Job B

Job C

Push to Production

Job X

Push to QA

50

Real workflows are complicated

51

Workflow management: Azkaban

Dependency management Diverse job types (Pig, Hive, Java, . . . ) Scheduling Monitoring Configuration Retry/restart on failure Resource locking Log collection Historical information

52

Workflow management: Azkaban

53

Workflow management: Azkaban

Top complaints from data scientists

1 Getting the data in (Ingress ETL)

2 Getting the data out (Egress)

3 Workflow management

4 Model of computation

5 …

Model of computation

• Alternating Direction Method of Multipliers (ADMM)• Distributed Conjugate Gradient Descent (DCGD)• Distributed L-BFGS• Bayesian Distributed Learning (BDL)

Graphs

Distributed learning

Near-line processing

LinkedIn Confidential ©2013 All Rights Reserved 56

LinkedIn Data Infrastructure: A few take-aways

1. Building infrastructure in a hyper-growth environment is challenging.

2. Few vs Many: Balance over-specialized (agile) vs generic efforts (leverage-able) platforms (*)

3. Balance open-source products with home-grown platforms (**)

4. Data Model and Integration e2e are key (*)

57

Learning more

data.linkedin.com

Recommended