Data Infrastructure at LinkedIn

Data Infrastructure at Linkedin

Jun Rao and Sam Shah

Outline

1. LinkedIn introduction

2. Online/nearline infrastructure overview

3. Infrastructure for data mining

4. Conclusion

The World’s Largest Professional Network

Members Worldwide

2 newMembers Per Second

100M+Monthly Unique Visitors

200M+ 2M+ Company Pages

Connecting Talent Opportunity. At scale…

Member ProfilesLarge dataset

Medium writes

Very high reads

Freshness <1s

People You May KnowLarge dataset

Compute intensive

High reads

Freshness ~hrs

LinkedIn Today Moving dataset

High writes

High reads

Freshness ~mins

The Big-Data Feedback Loop

Insights

Product

ScienceData

Member

Engagement

Virality

Signals

Refinement

InfrastructureAnalytics

LinkedIn Data Infrastructure: Three-Phase Abstraction

Users Online Data Infra

Near-Line Infra

Application Offline Data Infra

Infrastructure Latency & Freshness Requirements Products

Online Activity that should be reflected immediately• Member Profiles• Company Profiles• Connections

• Messages • Endorsements• Skills

Near-Line Activity that should be reflected soon

• Activity Streams• Profile Standardization• News

• Recommendations• Search• Messages

Offline Activity that can be reflected later

• People You May Know• Connection Strength• News

• Recommendations• Next best idea…

LinkedIn Data Infrastructure: Sample Stack

Infra challenges in 3-phase ecosystem are diverse, complex and specific

Some off-the-shelf.Significant investment in home-grown, deep and

interesting platforms

Databus

Streaming Transactions

Databus : Timeline-Consistent Change Data Capture

LinkedIn Data Infrastructure Solutions

Databus at LinkedIn

Bootstrap

CaptureChanges

On-lineChanges

Compressed

Delta Since T

Consistent

Snapshot at U

Transport independent of data source: Oracle, MySQL, …

Transactional semantics In order, at least once delivery

Tens of relays Hundreds of sources Low latency - milliseconds

Consumer 1

Consumer n

Client

Consumer 1

Consumer n

Client

Event Win

Scaling Core Databases

Voldemort: Highly-Available Distributed KV Store

• Pluggable components• Tunable consistency /

availability• Key/value model,

server side “views”

• 10 clusters, 100+ nodes• Largest cluster – 10K+ qps• Avg latency: 3ms• Hundreds of Stores• Largest store – 2.8TB+

Voldemort: Architecture

Streaming Non-transactional Events

Offline

Nearline

Processing

Kafka: High-Volume Low-Latency Messaging System

Kafka Architecture

Producer

Consumer

Producer

Consumer

Zookeeper

topic1-part1

topic2-part2

topic2-part1

topic1-part2

topic2-part2

topic2-part1

topic1-part1 topic1-part2

topic2-part2

topic2-part1

Broker 1 Broker 2 Broker 3 Broker 4

Key features• Scale-out architecture• High throughput• Automatic load balancing• Intra-cluster replication

Per day stats• writes: 10+ billion messages• reads: 50+ billion messages

Filling in the Data Store Gap

Text Search

Espresso: Indexed Timeline-Consistent Distributed Data Store

Application View

Hierarchical data model

Rich functionality on resources Conditional updates Partial updates Atomic counters

Rich functionality withinresource groups

Transactions Secondary index Text search

Espresso: System Components

• Partitioning/replication• Timeline consistency• Change propagation

Generic Cluster Manager: Helix

• Generic Distributed State Model• Config Management• Automatic Load Balancing• Fault tolerance• Cluster expansion and rebalancing

• Espresso, Databus and Search• Open Source Apr 2012• https://github.com/linkedin/helix

Infrastructure challenges in large-scale data mining

Putting it together

Top complaints from data scientists

1 Getting the data in (Ingress ETL)

2 Getting the data out (Egress)

3 Workflow management

4 Model of computation

LinkedIn circa 2010

O(n2) data integration complexity

Infrastructure fragility

• Can’t get all data• Hard to operate• Multi-hour delay• Labor intensive• Slow• Does it work?

Process fragility

• Labor intensive• One man’s

cleaning…

Data model

{ tracking_code=null, session_id=42, tracking_time=Tue Jul 31 07:27:25 PDT 2010, error_key=null, locale=en_us, browser_id=ddc61a81-5311-4859-be42-ca7dc7b941e3, member_id=1213, page_key=profile, tracking_info=Viewee=1214,lnl=f,nd=1,o=1214,^SP=pId-'pro_stars',rslvd=t,vs=v,vid=1214,ps=EDU|EXP|SKIL|, error_id=null, page_type=FULL_PAGE, request_path=view ...}

Data model (cont’d)

{ article_id=5560874437395353942, title=Five Good Reasons to Hire the Unemployed, language=en_US, article_source=bit.ly,url=aHR0cDovL3d3dy5vbmV0aGluZ25ldy5jb20vaW5kZXgucGhwL3dvcmsvMTAyLWZpdmUtZ29vZC1yZWFzb25zLXRvLWhpcmUtdGhlLXVuZW1wbG95ZWQK,...}

Problems

1 Data integration across systems

2 Fragile infrastructure

3 Lack of proper data models (ad-hoc)

LinkedIn 2013

O(n) data integration

Publish/subscribe commit log

Data model

Hundreds of message types Thousands of fields What do they all mean? What happens when they change?

Data model

1 Education

2 Push data cleanliness upstream

3 O(1) ETL

4 Evidence-based correctness

Data model

DDL for data definition and schema Central versioned registry of all schemas Schema review Programmatic compatibility model

– Schema changes handled transparently

Workflow

1 Check in schema

2 Code review

3 Ship

Seamless data load into downstream systems

Audit trail

Result: complete, verified copy of all data available

Egress

store DATA into ‘kafka://…’ using Stream();

Workflows

Push to Production

Workflows

Push to Production

Workflows

Push to Production

Push to QA

Real workflows are complicated

Workflow management: Azkaban

Dependency management Diverse job types (Pig, Hive, Java, . . . ) Scheduling Monitoring Configuration Retry/restart on failure Resource locking Log collection Historical information

Workflow management: Azkaban

Model of computation

• Alternating Direction Method of Multipliers (ADMM)• Distributed Conjugate Gradient Descent (DCGD)• Distributed L-BFGS• Bayesian Distributed Learning (BDL)

Graphs

Distributed learning

Near-line processing

LinkedIn Data Infrastructure: A few take-aways

1. Building infrastructure in a hyper-growth environment is challenging.

2. Few vs Many: Balance over-specialized (agile) vs generic efforts (leverage-able) platforms (*)

3. Balance open-source products with home-grown platforms (**)

4. Data Model and Integration e2e are key (*)

Learning more

data.linkedin.com

Data Infrastructure at LinkedIn

Technology

Big Data and Data Standardization at LinkedIn

Data Infrastructure at LinkedIn - University of Pittsburghviz/classes/infsci3350/resources/linkedin_icde12.pdf · Data Infrastructure at LinkedIn Aditya Auradkar, Chavdar Botev, Shirshanka

The "Big Data" Ecosystem at LinkedIn

©2014 LinkedIn Corporation. All Rights Reserved. Gobblin’ Big Data with Ease Lin Qiao Data Analytics Infra @ LinkedIn

The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

Data driven recruitment_PwC & LinkedIn

Department of Defense Explosives Safety Board (DDESB) · •Image data Infrastructure ES - Output Data •Tabular data •Exposed personnel •Infrastructure by type data •Infrastructure

LinkedIn Sales Navigator Data Sheet

Inter-Cloud Infrastructure · Inter-Cloud Infrastructure Cloud Infrastructure for Big Data Analysis Inter-Cloud Infrastructure for Big Data Analysis

LinkedIn Infrastructure (analytics@webscale, at fb 2013)

Data Center Design Utilizing Data Center Infrastructure ... · Data Center Design Utilizing Data Center Infrastructure ManagementData Center Infrastructure Management (DCIM) Chuck

Data Center Infrastructure Management (DCIM) Made Easy Center Infrastructure... · Infrastructure Management & Monitoring for Business-Critical ContinuityTM Data Center Infrastructure

Data Infrastructure @ LinkedIn

Data Infrastructure at LinkedIn

LinkedIn InMaps, Data Visualization Meetup 2011

LinkedIn Data Infrastructure (QCon London 2012)

Storage Infrastructure behind LinkedIn’s RecommendationsLambda Architecture @ LinkedIn Voldemort BuildandPush Incremental View (Samza) Data Ingest Batch layer Speed layer Read Only

LinkedIn Data Infrastructure Slides (Version 2)

Building Data Products at LinkedIn with DataFu

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn