2. About Me
Current Lifeβ¦
οβ― LinkedIn *
οβ― Web / Software Engineering
οβ― Search, Network, and Analytics (SNA)
οβ― Distributed Data Systems (DDS)
οβ― Me
In a Previous Lifeβ¦
οβ― Netflix, Cloud Database Architect
οβ― eBay, Web Development, Research Lab, & Search
Engine ***
And Many Years Priorβ¦
οβ― Studying Distributed Systems at Cornell University
@r39132 2
3. Our mission
Connect the worldβs professionals to make
them more productive and successful
@r39132 3
4. The worldβs largest professional network
Over 60% of members are now international
75%
**
150M+ *
90 Fortune 100 Companies
use LinkedIn to hire
>2M
***
Company Pages
55
32
16
Languages
~4.2B
17 ***
8
2 4
Professional searches in 2011
2004 2005 2006 2007 2008 2009 2010 *as of February 9, 2012
**as of September 30, 2011
LinkedIn Members (Millions) ***as of December 31, 2011
@r39132 4
5. Other Company Facts
β’β― Headquartered in Mountain View, Calif., with offices around the world!
*
β’β― As of December 31, 2011, LinkedIn has 2,116 full-time employees located
around the world.
β’β― Currently around 650 people work in Engineering
β’β― 400 in Web/Software Engineering
β’β― Plan to add another 200 in 2012 ***
β’β― 250 in Operations
*as of February 9, 2012
**as of September 30, 2011
***as of December 31, 2011
@r39132 5
6. Agenda
οΌβ― Company Overview
ο§β― Architecture
ββ― Data Infrastructure Overview
ββ― Technology Spotlight
ο§β― Oracle
ο§β― Voldemort
ο§β― DataBus
ο§β― Kafka
ο§β― Q & A
@r39132 6
7. LinkedIn : Architecture
Overview
β’β― Our site runs primarily on Java, with some use of Scala for specific
infrastructure
β’β― What runs on Scala?
β’β― Network Graph Service
β’β― Kafka
β’β― Most of our services run on Apache + Jetty
@r39132 7
8. LinkedIn : Architecture
ο β― A web page requests
information A and B
Presentation Tier ο β― A thin layer focused on
building the UI. It assembles
the page by making parallel
requests to BST services
A A B B C C Business Service Tier ο β― Encapsulates business
logic. Can call other BST
clusters and its own DST
cluster.
A A B B C C Data Service Tier ο β― Encapsulates DAL logic and
concerned with one Oracle
Schema.
Data Infrastructure ο β― Concerned with the
Master Slave Master Master
persistent storage of and
easy access to data
Memcached
Voldemort
@r39132 8
9. LinkedIn : Architecture
ο β― A web page requests
information A and B
Presentation Tier ο β― A thin layer focused on
building the UI. It assembles
the page by making parallel
requests to BST services
A A B B C C Business Service Tier ο β― Encapsulates business
logic. Can call other BST
clusters and its own DST
cluster.
A A B B C C Data Service Tier ο β― Encapsulates DAL logic and
concerned with one Oracle
Schema.
Data Infrastructure ο β― Concerned with the
Master Slave Master Master
persistent storage of and
easy access to data
Memcached
Voldemort
@r39132 9
10. LinkedIn : Data Infrastructure Technologies
β’β― Database Technologies
β’β― Oracle
β’β― Voldemort *
β’β― Espresso
β’β― Data Replication Technologies
β’β― Kafka
β’β― DataBus
β’β― Search Technologies
β’β― Zoie β real-time search and indexing with Lucene
β’β― Bobo β faceted search library for Lucene ***
β’β― SenseiDB β fast, real-time, faceted, KV and full-text Search Engine and more
@r39132 10
11. LinkedIn : Data Infrastructure Technologies
This talk will focus on a few of the key technologies below!
β’β― Database Technologies *
β’β― Oracle
β’β― Voldemort
β’β― Espresso β A new K-V store under development
β’β― Data Replication Technologies
β’β― Kafka
β’β― DataBus
β’β― Search Technologies ***
β’β― Zoie β real-time search and indexing with Lucene
β’β― Bobo β faceted search library for Lucene
β’β― SenseiDB β fast, real-time, faceted, KV and full-text Search Engine and more
@r39132 11
13. Oracle : Overview
Oracle
β’β― All user-provided data is stored in Oracle β our current source of truth
β’β― About 50 Schemas running on tens of physical instances
*
β’β― With our user base and traffic growing at an accelerating pace, so how do we
scale Oracle for user-provided data?
Scaling Reads
β’β― Oracle Slaves (c.f. DSC)
β’β― Memcached
β’β― Voldemort β for key-value lookups
***
Scaling Writes
β’β― Move to more expensive hardware or replace Oracle with something else
@r39132 13
14. Oracle : Overview β Data Service Context
Scaling Oracle Reads using DSC
ο§β― DSC uses a token (e.g. cookie) to ensure that a reader always
sees his or her own writes immediately
ββ― If I update my own status, it is okay if you donβt see the
change for a few minutes, but I have to see it immediately
@r39132 14
15. Oracle : Overview β How DSC Works?
ο§β― When a user writes data to the master, the DSC token (for that
data domain) is updated with a timestamp
ο§β― When the user reads data, we first attempt to read from a
replica (a.k.a. slave) database
ο§β― If the data in the slave is older than the data in the DSC token,
we read from the Master instead
@r39132 15
17. Voldemort : Overview
β’β― A distributed, persistent key-value store influenced by the AWS Dynamo paper
β’β― Key Features of Dynamo
ο±β― Highly Scalable, Available, and Performant
ο±β― Achieves this via Tunable Consistency
β’β― For higher consistency, the user accepts lower availability, scalability, and
performance, and vice-versa
ο±β― Provides several self-healing mechanisms when data does become inconsistent
β’β― Read Repair
οβ― Repairs value for a key when the key is looked up/read
β’β― Hinted Handoff
οβ― Buffers value for a key that wasnβt successfully written, then writes it later
β’β― Anti-Entropy Repair
οβ― Scans the entire data set on a node and fixes it
ο±β― Provides means to detect node failure and a means to recover from node failure
β’β― Failure Detection
β’β― Bootstrapping New Nodes
@r39132 17
18. Voldemort : Overview
API Layered, Pluggable Architecture
VectorClock<V> get (K key)
Client
put (K key, VectorClock<V> value)
applyUpdate(UpdateAction action, int retries) Client API
Conflict Resolution
Voldemort-specific Features Serialization
ο±β― Implements a layered, pluggable Repair Mechanism
architecture
Failure Detector
Routing
ο±β― Each layer implements a common interface
(c.f. API). This allows us to replace or
remove implementations at any layer Server
Repair Mechanism
β’β― Pluggable data storage layer Failure Detector Admin
οβ― BDB JE, Custom RO storage,
Routing
etcβ¦
Storage Engine
β’β― Pluggable routing supports
οβ― Single or Multi-datacenter routing
@r39132 18
19. Voldemort : Overview
Layered, Pluggable Architecture
Client
Voldemort-specific Features
Client API
Conflict Resolution
β’β― Supports Fat client or Fat Server Serialization
Repair Mechanism
β’β― Repair Mechanism + Failure Failure Detector
Detector + Routing can run on Routing
server or client
Server
β’β― LinkedIn currently runs Fat Client, but we Repair Mechanism
would like to move this to a Fat Server Failure Detector Admin
Model Routing
Storage Engine
@r39132 19
21. Voldemort : Usage Patterns @ LinkedIn
2 Usage-Patterns
ο§β― Read-Write Store
ββ― Uses BDB JE for the storage engine
ββ― 50% of Voldemort Stores (aka Tables) are RW
ο§β― Read-only Store
ββ― Uses a custom Read-only format
ββ― 50% of Voldemort Stores (aka Tables) are RO
ο§β― Letβs look at the RO Store
@r39132 21
22. Voldemort : RO Store Usage at LinkedIn
People You May Know
Viewers of this proο¬le also viewed
Related Searches
Events you may be interested in
LinkedIn Skills
Jobs you may be
interested in
@r39132 22
23. Voldemort : Usage Patterns @ LinkedIn
RO Store Usage Pattern
1.β― Use Hadoop to build a model
2.β― Voldemort loads the output of Hadoop
3.β― Voldemort serves fast key-value look-ups on the site
ββ― e.g. For key=βSid Anandβ, get all the people that βSid Anandβ may know!
ββ― e.g. For key=βSid Anandβ, get all the jobs that βSid Anandβ may be interested in!
@r39132 23
24. How Do The Voldemort RO
Stores Perform?
@r39132 24
28. DataBus : Use-Cases @ LinkedIn
Updates
Standard Search Graph Read
ization Index Index Replicas
Oracle
Data Change Events
A user updates his profile with skills and position history. He also accepts a connection
β’β― The write is made to an Oracle Master and DataBus replicates:
β’β― the profile change to the Standardization service
οβ― E.G. the many forms of IBM are canonicalized for search-friendliness and
recommendation-friendliness
β’β― the profile change to the Search Index service
οβ― Recruiters can find you immediately by new keywords
β’β― the connection change to the Graph Index service
οβ― The user can now start receiving feed updates from his new connections immediately
@r39132 28
29. DataBus : Architecture
DataBus consists of 2 services
Capture Relay β’β― Relay Services
DB Changes
Event Win β’β― Sharded
β’β― Maintain an in-memory buffer per
On-line shard
Changes
β’β― Each shard polls Oracle and then
deserializes transactions into Avro
Bootstrap
β’β― Bootstrap Service
DB β’β― Picks up online changes as they
appear in the Relay
β’β― Supports 2 types of operations
from clients
οβ― If a client falls behind and
needs records older than what
the relay has, Bootstrap can
send consolidated deltas!
οβ― If a new client comes on line,
Bootstrap can send a
consistent snapshot
@r39132 29
30. DataBus : Architecture
Client
Capture Relay Consumer 1
Client Lib
On-line
Databus
DB Changes Changes
Event Win Consumer n
On-line
ated
Changes solid
Con eT
Sinc
Delta Client
Bootstrap
Consumer 1
Client Lib
Consistent
Databus
Snapshot at U Consumer n
DB
Guarantees
ο§β― Transactional semantics
ο§β― In-commit-order Delivery
ο§β― At-least-once delivery
ο§β― Durability (by data source)
ο§β― High-availability and reliability
ο§β― Low latency
@r39132 30
31. DataBus : Architecture - Bootstrap
ο§β― Generate consistent snapshots and consolidated deltas
during continuous updates with long-running queries
Client
Read
Consumer 1
Client Lib
Databus
Online
Relay Changes
Consumer n
Event Win
Read
Changes Bootstrap server Bootstrap
Read
Log Writer Server
recent events
Replay
Log Storage events Snapshot Storage
Log Applier
@r39132 31
33. Kafka : Usage at LinkedIn
Where as DataBus is used for Database change capture and
replication, Kafka is used for application-level data streams
Examples:
β’β― End-user Action Tracking (a.k.a. Web Tracking) of
β’β― Emails opened
β’β― Pages seen
β’β― Links followed
β’β― Executing Searches
β’β― Operational Metrics
β’β― Network & System metrics such as
β’β― TCP metrics (connection resets, message resends, etcβ¦)
β’β― System metrics (iops, CPU, load average, etcβ¦)
@r39132 33
34. Kafka : Overview
Broker Tier
WebTier Consumers
Push Sequential write sendfile Pull
Events Events Iterator 1
Client Lib
Topic 1
Kafka
100 MB/sec 200 MB/sec
Topic 2
Iterator n
Topic N
Topic ο Offset
Topic, Partition Offset
Ownership
Zookeeper Management
Features Guarantees Scale
ο§β― Pub/Sub ο§β― At least once delivery ο§β― Billions of Events
ο§β― Batch Send/Receive ο§β― Very high throughput ο§β― TBs per day
ο§β― System Decoupling ο§β― Low latency ο§β― Inter-colo: few seconds
ο§β― Durability ο§β― Typical retention: weeks
ο§β― Horizontally Scalable
@r39132 34
35. Kafka : Overview
Key Design Choices
β’β― When reading from a file and sending to network socket, we typically incur 4 buffer copies and
2 OS system calls
οβ― Kafka leverages a sendFile API to eliminate 2 of the buffer copies and 1 of the system
calls
β’β― No double-buffering of messages - we rely on the OS page cache and do not store a copy of
the message in the JVM
οβ― Less pressure on memory and GC
οβ― If the Kafka process is restarted on a machine, recently accessed messages are still in
the page cache, so we get the benefit of a warm start
β’β― Kafka doesn't keep track of which messages have yet to be consumed -- i.e. no book keeping
overhead
οβ― Instead, messages have time-based SLA expiration -- after 7 days, messages are deleted
@r39132 35
40. Acknowledgments
Presentation & Content
ο§β― Chavdar Botev (DataBus) @cbotev
ο§β― Roshan Sumbaly (Voldemort) @rsumbaly
ο§β― Neha Narkhede (Kafka) @nehanarkhede
Development Team
Aditya Auradkar, Chavdar Botev, Shirshanka Das, Dave DeMaagd, Alex
Feinberg, Phanindra Ganti, Lei Gao, Bhaskar Ghosh, Kishore
Gopalakrishna, Brendan Harris, Joel Koshy, Kevin Krawez, Jay Kreps, Shi
Lu, Sunil Nagaraj, Neha Narkhede, Sasha Pachev, Igor Perisic, Lin Qiao,
Tom Quiggle, Jun Rao, Bob Schulman, Abraham Sebastian, Oliver Seeliger,
Adam Silberstein, Boris Skolnick, Chinmay Soman, Roshan Sumbaly, Kapil
Surlaker, Sajid Topiwala, Balaji Varadarajan, Jemiah Westerman, Zach
White, David Zhang, and Jason Zhang
@r39132 40