Data Infrastructure at LinkedIn

Data Infrastructure at Linkedin
Jun Rao and Sam Shah
LinkedIn Confidential ©2013 All Rights Reserved

Outline
LinkedIn Confidential ©2013 All Rights Reserved 2
1. LinkedIn introduction
2. Online/nearline infrastructure overview
3. Infrastructure for data mining
4. Conclusion

The World‟s Largest Professional Network
Members Worldwide
2 new
Members Per Second
100M+
Monthly Unique Visitors
200M+ 2M+
Company Pages
Connecting Talent  Opportunity. At scale…

4
Member Profiles
Large dataset
Medium writes
Very high reads
Freshness <1s

People You May Know
5
Large dataset
Compute intensive
High reads
Freshness ~hrs

LinkedIn Today
6
Moving dataset
High writes
High reads
Freshness ~mins

The Big-Data Feedback Loop
Value 
Insights 
Scale 
Product
ScienceData
Member
Engagement 
Virality 
Signals 
Refinement 
Infrastructure
Analytics 

LinkedIn Data Infrastructure: Three-Phase Abstraction
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
Infrastructure Latency & Freshness Requirements Products
Online Activity that should be reflected immediately
• Member Profiles
• Company Profiles
• Connections
• Messages
• Endorsements
• Skills
Near-Line Activity that should be reflected soon
• Activity Streams
• Profile Standardization
• News
• Recommendations
• Search
• Messages
Offline Activity that can be reflected later
• People You May Know
• Connection Strength
• News
• Recommendations
• Next best idea…

LinkedIn Data Infrastructure: Sample Stack
9
Infra challenges in 3-phase
ecosystem are diverse,
complex and specific
Some off-the-shelf.
Significant investment in
home-grown, deep and
interesting platforms

Databus : Timeline-Consistent
Change Data Capture
LinkedIn Data Infrastructure Solutions

Databus at LinkedIn
12
DB
Bootstrap
Capture
Changes
On-line
Changes
On-line
Changes
DB
Consistent
Snapshot at U
 Transport independent of data
source: Oracle, MySQL, …
 Transactional semantics
 In order, at least once delivery
 Tens of relays
 Hundreds of sources
 Low latency - milliseconds
Consumer 1
Consumer n
Client
Databus
ClientLib
Consumer 1
Consumer n
Databus
ClientLib
Client
Relay
Event Win

Scaling Core Databases
13
RO
RO
RO

Voldemort: Highly-Available
Distributed KV Store
14

• Pluggable components
• Tunable consistency /
availability
• Key/value model,
server side “views”
• 10 clusters, 100+ nodes
• Largest cluster – 10K+ qps
• Avg latency: 3ms
• Hundreds of Stores
• Largest store – 2.8TB+
Voldemort: Architecture

Streaming Non-transactional Events
16
Offline
Nearline
Processing

Kafka: High-Volume Low-Latency
Messaging System
17

Kafka Architecture
Producer
Consumer
Producer
Consumer
Zookeeper
topic1-part1
topic2-part2
topic2-part1
topic1-part2
topic2-part2
topic2-part1
topic1-part1 topic1-part2
topic1-part1 topic1-part2
topic2-part2
topic2-part1
Broker 1 Broker 2 Broker 3 Broker 4
Key features
• Scale-out architecture
• High throughput
• Automatic load balancing
• Intra-cluster replication
Per day stats
• writes: 10+ billion messages
• reads: 50+ billion messages

Filling in the Data Store Gap
19
Text
Search

Espresso: Indexed Timeline-Consistent
Distributed Data Store
20

Application View
21
Hierarchical data model
Rich functionality on resources
 Conditional updates
 Partial updates
 Atomic counters
Rich functionality within
resource groups
 Transactions
 Secondary index
 Text search

Espresso: System Components
22
• Partitioning/replication
• Timeline consistency
• Change propagation

Generic Cluster Manager: Helix
• Generic Distributed State Model
• Config Management
• Automatic Load Balancing
• Fault tolerance
• Cluster expansion and rebalancing
• Espresso, Databus and Search
• Open Source Apr 2012
• https://github.com/linkedin/helix
23

Infrastructure challenges in
large-scale data mining
Putting it together

Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow management
4 Model of computation
5 …

LinkedIn circa 2010

O(n2) data integration complexity

Infrastructure fragility
• Can‟t get all data
• Hard to operate
• Multi-hour delay
• Labor intensive
• Slow
• Does it work?

Process fragility
• Labor intensive
• One man‟s
cleaning…
FE
MT
BE
DT
FE Dev
BE Dev
ETL
Team
ETL DW/
Hadoop

Data model
{
tracking_code=null,
session_id=42,
tracking_time=Tue Jul 31 07:27:25 PDT 2010,
error_key=null,
locale=en_us,
browser_id=ddc61a81-5311-4859-be42-ca7dc7b941e3,
member_id=1213,
page_key=profile,
tracking_info=Viewee=1214,lnl=f,nd=1,o=1214,^SP=pId-
'pro_stars',rslvd=t,vs=v,vid=1214,ps=EDU|EXP|SKIL|,
error_id=null,
page_type=FULL_PAGE,
request_path=view
...
}

Data model (cont‟d)
{
article_id=5560874437395353942,
title=Five Good Reasons to Hire the Unemployed,
language=en_US,
article_source=bit.ly,
url=aHR0cDovL3d3dy5vbmV0aGluZ25ldy5jb20vaW5kZXgucGhwL3dvcmsvMTAyLWZpdmUtZ29v
ZC1yZWFzb25zLXRvLWhpcmUtdGhlLXVuZW1wbG95ZWQK,
...
}

Problems
1 Data integration across systems
2 Fragile infrastructure
3 Lack of proper data models (ad-hoc)

LinkedIn 2013

Data model
 Hundreds of message types
 Thousands of fields
 What do they all mean?
 What happens when they change?

Data model
1 Education
2 Push data cleanliness upstream
3 O(1) ETL
4 Evidence-based correctness

Data model
 DDL for data definition and schema
 Central versioned registry of all schemas
 Schema review
 Programmatic compatibility model
– Schema changes handled transparently

Workflow
1 Check in schema
2 Code review
3 Ship
Seamless data load into downstream systems

Result: complete, verified copy of all
data available

Egress
store DATA into „kafka://…‟ using Stream();

Workflows
46
Job A
Job B
Job C

Workflows
47
Job A
Job B
Job C
Push to Production

Workflows
48
Job A
Job B
Job C
Push to Production
Job X

Workflows
49
Job A
Job B
Job C
Push to Production
Job X
Push to QA

Real workflows are complicated
50

Workflow management: Azkaban
51
 Dependency management
 Diverse job types (Pig, Hive, Java, . . . )
 Scheduling
 Monitoring
 Configuration
 Retry/restart on failure
 Resource locking
 Log collection
 Historical information

52

53

Model of computation
• Alternating Direction Method of Multipliers (ADMM)
• Distributed Conjugate Gradient Descent (DCGD)
• Distributed L-BFGS
• Bayesian Distributed Learning (BDL)
Graphs
Distributed learning
Near-line processing

LinkedIn Data Infrastructure: A few take-aways
1. Building infrastructure in a hyper-growth
environment is challenging.
2. Few vs Many: Balance over-specialized (agile)
vs generic efforts (leverage-able) platforms (*)
3. Balance open-source products with home-
grown platforms (**)
4. Data Model and Integration e2e are key (*)

57
Learning more
data.linkedin.com

Data Infrastructure at LinkedIn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Data Infrastructure at LinkedIn

Similar to Data Infrastructure at LinkedIn (20)

More from Amy W. Tang

More from Amy W. Tang (8)

Recently uploaded

Recently uploaded (20)

Data Infrastructure at LinkedIn

Editor's Notes