View
2.822
Download
3
Category
Tags:
Preview:
DESCRIPTION
This talk was given by Jun Rao (Staff Software Engineer at LinkedIn) and Sam Shah (Senior Engineering Manager at LinkedIn) at the Analytics@Webscale Technical Conference (June 2013).
Citation preview
LinkedIn Confidential ©2013 All Rights Reserved
Data Infrastructure at Linkedin
Jun Rao and Sam Shah
LinkedIn Confidential ©2013 All Rights Reserved 2
Outline
1. LinkedIn introduction
2. Online/nearline infrastructure overview
3. Infrastructure for data mining
4. Conclusion
The World’s Largest Professional Network
Members Worldwide
2 newMembers Per Second
100M+Monthly Unique Visitors
200M+ 2M+ Company Pages
Connecting Talent Opportunity. At scale…
LinkedIn Confidential ©2013 All Rights Reserved 3
4
Member ProfilesLarge dataset
Medium writes
Very high reads
Freshness <1s
5
People You May KnowLarge dataset
Compute intensive
High reads
Freshness ~hrs
6
LinkedIn Today Moving dataset
High writes
High reads
Freshness ~mins
LinkedIn Confidential ©2013 All Rights Reserved 7
The Big-Data Feedback Loop
Value
Insights
Scale
Product
ScienceData
Member
Engagement
Virality
Signals
Refinement
InfrastructureAnalytics
LinkedIn Confidential ©2013 All Rights Reserved 8
LinkedIn Data Infrastructure: Three-Phase Abstraction
Users Online Data Infra
Near-Line Infra
Application Offline Data Infra
Infrastructure Latency & Freshness Requirements Products
Online Activity that should be reflected immediately• Member Profiles• Company Profiles• Connections
• Messages • Endorsements• Skills
Near-Line Activity that should be reflected soon
• Activity Streams• Profile Standardization• News
• Recommendations• Search• Messages
Offline Activity that can be reflected later
• People You May Know• Connection Strength• News
• Recommendations• Next best idea…
9
LinkedIn Data Infrastructure: Sample Stack
Infra challenges in 3-phase ecosystem are diverse, complex and specific
Some off-the-shelf.Significant investment in home-grown, deep and
interesting platforms
Databus
10
Streaming Transactions
Databus : Timeline-Consistent Change Data Capture
LinkedIn Data Infrastructure Solutions
Databus at LinkedIn
12
DB
Bootstrap
CaptureChanges
On-lineChanges
On-lineChanges
DB
Compressed
Delta Since T
Consistent
Snapshot at U
Transport independent of data source: Oracle, MySQL, …
Transactional semantics In order, at least once delivery
Tens of relays Hundreds of sources Low latency - milliseconds
Consumer 1
Consumer n
Client
Dat
abus
C
lient
Lib
Consumer 1
Consumer n
Dat
abus
C
lient
Lib
Client
Relay
Event Win
13
Scaling Core Databases
RO
RO
RO
14
Voldemort: Highly-Available Distributed KV Store
LinkedIn Data Infrastructure Solutions
• Pluggable components• Tunable consistency /
availability• Key/value model,
server side “views”
• 10 clusters, 100+ nodes• Largest cluster – 10K+ qps• Avg latency: 3ms• Hundreds of Stores• Largest store – 2.8TB+
Voldemort: Architecture
16
Streaming Non-transactional Events
Offline
Nearline
Processing
17
Kafka: High-Volume Low-Latency Messaging System
LinkedIn Data Infrastructure Solutions
Kafka Architecture
Producer
Consumer
Producer
Consumer
Zookeeper
topic1-part1
topic2-part2
topic2-part1
topic1-part2
topic2-part2
topic2-part1
topic1-part1 topic1-part2
topic1-part1 topic1-part2
topic2-part2
topic2-part1
Broker 1 Broker 2 Broker 3 Broker 4
Key features• Scale-out architecture• High throughput• Automatic load balancing• Intra-cluster replication
Per day stats• writes: 10+ billion messages• reads: 50+ billion messages
19
Filling in the Data Store Gap
Text Search
20
Espresso: Indexed Timeline-Consistent Distributed Data Store
LinkedIn Data Infrastructure Solutions
Application View
21
Hierarchical data model
Rich functionality on resources Conditional updates Partial updates Atomic counters
Rich functionality withinresource groups
Transactions Secondary index Text search
22
Espresso: System Components
• Partitioning/replication• Timeline consistency• Change propagation
23
Generic Cluster Manager: Helix
• Generic Distributed State Model• Config Management• Automatic Load Balancing• Fault tolerance• Cluster expansion and rebalancing
• Espresso, Databus and Search• Open Source Apr 2012• https://github.com/linkedin/helix
Infrastructure challenges in large-scale data mining
Putting it together
Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow management
4 Model of computation
5 …
Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow management
4 Model of computation
5 …
LinkedIn Confidential ©2013 All Rights Reserved 27
LinkedIn circa 2010
O(n2) data integration complexity
Infrastructure fragility
• Can’t get all data• Hard to operate• Multi-hour delay• Labor intensive• Slow• Does it work?
Process fragility
• Labor intensive• One man’s
cleaning…
Data model
{ tracking_code=null, session_id=42, tracking_time=Tue Jul 31 07:27:25 PDT 2010, error_key=null, locale=en_us, browser_id=ddc61a81-5311-4859-be42-ca7dc7b941e3, member_id=1213, page_key=profile, tracking_info=Viewee=1214,lnl=f,nd=1,o=1214,^SP=pId-'pro_stars',rslvd=t,vs=v,vid=1214,ps=EDU|EXP|SKIL|, error_id=null, page_type=FULL_PAGE, request_path=view ...}
Data model (cont’d)
{ article_id=5560874437395353942, title=Five Good Reasons to Hire the Unemployed, language=en_US, article_source=bit.ly,url=aHR0cDovL3d3dy5vbmV0aGluZ25ldy5jb20vaW5kZXgucGhwL3dvcmsvMTAyLWZpdmUtZ29vZC1yZWFzb25zLXRvLWhpcmUtdGhlLXVuZW1wbG95ZWQK,...}
Problems
1 Data integration across systems
2 Fragile infrastructure
3 Lack of proper data models (ad-hoc)
LinkedIn Confidential ©2013 All Rights Reserved 34
LinkedIn 2013
O(n) data integration
Publish/subscribe commit log
Data model
Hundreds of message types Thousands of fields What do they all mean? What happens when they change?
Data model
1 Education
2 Push data cleanliness upstream
3 O(1) ETL
4 Evidence-based correctness
Data model
DDL for data definition and schema Central versioned registry of all schemas Schema review Programmatic compatibility model
– Schema changes handled transparently
Workflow
1 Check in schema
2 Code review
3 Ship
Seamless data load into downstream systems
Audit trail
Result: complete, verified copy of all data available
Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow management
4 Model of computation
5 …
Egress
store DATA into ‘kafka://…’ using Stream();
Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow management
4 Model of computation
5 …
46
Workflows
Job A
Job B
Job C
47
Workflows
Job A
Job B
Job C
Push to Production
48
Workflows
Job A
Job B
Job C
Push to Production
Job X
49
Workflows
Job A
Job B
Job C
Push to Production
Job X
Push to QA
50
Real workflows are complicated
51
Workflow management: Azkaban
Dependency management Diverse job types (Pig, Hive, Java, . . . ) Scheduling Monitoring Configuration Retry/restart on failure Resource locking Log collection Historical information
52
Workflow management: Azkaban
53
Workflow management: Azkaban
Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow management
4 Model of computation
5 …
Model of computation
• Alternating Direction Method of Multipliers (ADMM)• Distributed Conjugate Gradient Descent (DCGD)• Distributed L-BFGS• Bayesian Distributed Learning (BDL)
Graphs
Distributed learning
Near-line processing
LinkedIn Confidential ©2013 All Rights Reserved 56
LinkedIn Data Infrastructure: A few take-aways
1. Building infrastructure in a hyper-growth environment is challenging.
2. Few vs Many: Balance over-specialized (agile) vs generic efforts (leverage-able) platforms (*)
3. Balance open-source products with home-grown platforms (**)
4. Data Model and Integration e2e are key (*)
57
Learning more
data.linkedin.com
Recommended