SlideShare a Scribd company logo
1 of 31
Download to read offline
Databus




1/29/2013   Recruiting Solutions
            `                        Databus   1
INTRODUCTION


  `            2
LinkedIn by Numbers
 World’s largest professional network
 187M+ members world-wide as of Q3 2012
   Growing at the rate of two per second
 85 of Fortune 100 companies use Talent Solutions
  to hire
 > 2.6M company pages
 > 4B search queries
 75K+ developers leveraging out APIs
 1.3M unique publishers


     `                    Databus                3
The Consequence of Specialization in
           Data Systems
Data Flow is essential
Data Consistency is critical !!!




        `
Solution: Databus



              Standardi
               Standardi    Standardi
                             Standardi   Standardi
                                          Standardi   Standardi
                                                       Standardi
                Standardi       Search       Graph         Read
  Updates




                zation
                 zation       zation
                               zation      zation
                                            zation      zation
                                                         zation
                  zation         Index       Index       Replicas




Primary
   DB                       Data Change Events

                               Databus

  `                                                                 5
Two Ways


    Application code dual    Extract changes from
    writes to database and   database commit log
    pub-sub system




    Easy on the surface      Tough but possible

    Consistent?              Consistent!!!



`
Key Design Decisions : Semantics
• Logical clocks attached to the source
  – Physical offsets could be used for internal
    transport
  – Simplifies data portability
• Pull model
  – Restarts are simple
  – Derived State = f (Source state, Clock)
  – + Idempotence = Timeline Consistent!

     `                                            7
Key Design Decisions : Systems
• Isolate fast consumers from slow consumers
  – Workload separation between online, catch-up,
    bootstrap
• Isolate sources from consumers
  – Schema changes
  – Physical layout changes
  – Speed mismatch
• Schema-aware
  – Filtering, Projections
  – Typically network-bound  can burn more CPU

     `                                              8
Requirements
•   Timeline consistency
•   Guaranteed, at least once delivery
•   Low latency
•   Schema evolution
•   Source independence
•   Scalable consumers
•   Handle for slow/new consumers without
    affecting happy ones (look-back requirements)

      `                                         9
ARCHITECTURE


  `            10
0
                          Initial Design (2007)                                   Happy
                                                                                 Consumer
         Source clock
            timer




                                                                                       …
             SCN

                                 Direct Pull              Relay                    Happy
                            DB                           In Memory                Consumer
 70000                                                     Buffer
                                  Proxied
             3 hrs
                                    Pull
100000       Relay
102400                                                                             Slow
                     DB
                                                                                 Consumer




Pros:
                                                  Cons:
1. Consumer Scaling
                                                  Slow consumers overwhelming the DB
2. Some isolation


                `                              Databus                                 11
Software Architecture
                 Four Logical Components

                  • Fetcher
                      – Fetch from db, relay…
                  • Log Store
                      – Store log snippet
                  • Snapshot Store
                      – Store moving data
                        snapshot
                  • Subscription Client
                      – Orchestrate pull
                        across these


`
0
        Source clock
           timer
            SCN
                                  The Databus System                                    Happy
                       Snapshot                                                        Consumer




                                                                                         …
                                  infinite
30000                  Log
                                                           Relay                        Happy
                       10 days                        In Memory                        Consumer
 70000   Relay
                                                        Buffer
 80000
 90000     3 hrs
100000
102400                                                                                   Slow
                       DB
                                                                                       Consumer


                                                                   Server


                                             Log Storage              Snapshot Store


                                                                   Bootstrap Service

                   `                                                                       13
The Relay
•   Change event buffering (~ 2 – 7 days)
•   Low latency (10-15 ms)
•   Filtering, Projection
•   Hundreds of consumers per relay
•   Scale-out, High-availability through
    redundancy



       `
Deployment Options




Option 1: Peered Deployment   Option 2: Clustered Deployment

     `
The Bootstrap Service
•   Catch-all for slow / new consumers
•   Isolate source OLTP instance from large scans
•   Log Store + Snapshot Store
•   Optimizations
    – Periodic merge
    – Predicate push-down
    – Catch-up versus full bootstrap
• Guaranteed progress for consumers via chunking
• Implementations
    – Database (MySQL)
    – Raw Files
• Bridges the continuum between stream and batch systems

       `
The Consumer Client Library
• Glue between Databus infra and business logic
  in the consumer
• Isolates the consumer from changes in the
  databus layer.
• Switches between relay and bootstrap as
  needed
• API
  – Callback with transactions
  – Iterators over windows

    `
Fetcher Implementations
• Oracle
  – Trigger-based
• MySQL
  – Custom-storage-engine based
• In Labs
  – Alternative implementations for Oracle
  – OpenReplicator integration for MySQL


     `
Meta-data Management
• Event definition, serialization and transport
  – Avro
• Oracle, MySQL
  – Avro definition generated from the table schema
• Schema evolution
  – Only backwards-compatible changes allowed
• Isolation between upgrades on producer and
  consumer

     `
Scaling the consumers
                (Partitioning)
• Server-side filtering
  – Range, mod, hash
  – Allows client to control partitioning function
• Consumer groups
  – Distribute partitions evenly across a group
  – Move partitions to available consumers on failure
  – Minimize re-processing



     `
A NEW CONSUMER


 `               21
Development with Databus – Client
                   Library
    Databus Client

                     Consumers
                      Consumers

                          implement



          Stream Event      Bootstrap Event
             Callback          Callback                        Client API
               API                API

                                   Databus Client Library

onDataEvent(DbusEvent, Decoder)                  register(consumers, sources , filter)
…                                                start() ,
…
                                                 shutdown(),

           `                           Databus                                      22
Databus Consumer Implementation
class MyConsumer
      extends AbstractDatabusStreamConsumer
{
   ConsumerCallbackResult onDataEvent(DbusEvent e,
                                       DbusEventDecoder d){
    //use map-like Avro GenericRecord
    GenericRecord g = d.getGenericRecord(e, null);
    //or use the auto-generated Java class
    MyEvent e = d.getTypedValue(e, null,
                                            MyEvent.class);
    …
    return ConsumerCallbackResult.SUCCESS;
  }
}

     `                     Databus                       23
Starting the client
public void main(String[]) {
  //configure
  DatabusHttpClientImpl.Config clientConfig =
                          new DatabusHttpClientImpl.Config();
  clientConfig.loadFromFile(“mydbus”, “mdbus.props”);
  DatabusHttpClientImpl client =
               new DatabusHttpClientImpl(clientConfig);
  //register callback
  MyConsumer callback = new MyConsumer();
  client.registerDatabusStreamListener(callback,
          null, "com.linkedin.events.member2.MemberProfile”);
  //start client library
  client.startAndBlock();
}

        `                    Databus                        24
Event Callback APIs
•




    `           Databus       25
PERFORMANCE


 `            26
Relay Throughput




`         Databus      27
Consumer Throughput




`           Databus       28
End-End Latency




`         Databus     29
Snapshot vs Catchup




`           Databus       30
Recruiting Solutions   31

More Related Content

What's hot

Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamDataWorks Summit
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterMat Keep
 
Characteristics of no sql databases
Characteristics of no sql databasesCharacteristics of no sql databases
Characteristics of no sql databasesDipti Borkar
 
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scaleHow LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scaleLinkedIn
 
Couchbase and Apache Spark
Couchbase and Apache SparkCouchbase and Apache Spark
Couchbase and Apache SparkMatt Ingenthron
 
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)Kanak Biscuitwala
 
Aceleracion de aplicacione 2
Aceleracion de aplicacione 2Aceleracion de aplicacione 2
Aceleracion de aplicacione 2jfth
 
In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified! In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified! Uri Cohen
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchDataWorks Summit/Hadoop Summit
 
Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0EDB
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservicesBigstep
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDatabricks
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Spark Summit
 
Building Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache GeodeBuilding Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache GeodePivotalOpenSourceHub
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkMatt Ingenthron
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
 

What's hot (20)

Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Characteristics of no sql databases
Characteristics of no sql databasesCharacteristics of no sql databases
Characteristics of no sql databases
 
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scaleHow LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
 
Couchbase and Apache Spark
Couchbase and Apache SparkCouchbase and Apache Spark
Couchbase and Apache Spark
 
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
 
Aceleracion de aplicacione 2
Aceleracion de aplicacione 2Aceleracion de aplicacione 2
Aceleracion de aplicacione 2
 
In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified! In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified!
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba Search
 
Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
 
Building Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache GeodeBuilding Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache Geode
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
 

Similar to Introduction to Databus

Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...SL Corporation
 
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseSQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseMark Ginnebaugh
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...SL Corporation
 
Lug best practice_hpc_workflow
Lug best practice_hpc_workflowLug best practice_hpc_workflow
Lug best practice_hpc_workflowrjmurphyslideshare
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
How to Build a SaaS App With Twitter-like Throughput on Just 9 ServersHow to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
How to Build a SaaS App With Twitter-like Throughput on Just 9 ServersNew Relic
 
Times Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo LudasTimes Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo LudasORACLE USER GROUP ESTONIA
 
Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inliqiang xu
 
Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure bloomreacheng
 
Lync 2010 High Availability
Lync 2010 High AvailabilityLync 2010 High Availability
Lync 2010 High AvailabilityHarold Wong
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisScaleOut Software
 
Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineSunil Nagaraj
 
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012Michael Noel
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
Zeroth review presentation - eBay Turmeric / SMC
Zeroth review presentation - eBay Turmeric / SMCZeroth review presentation - eBay Turmeric / SMC
Zeroth review presentation - eBay Turmeric / SMCArvind Krishnaa
 
Top 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data GridTop 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data GridScaleOut Software
 
The 5 Stages of Scale
The 5 Stages of ScaleThe 5 Stages of Scale
The 5 Stages of Scalexcbsmith
 
Oracle in the Cloud
Oracle in the CloudOracle in the Cloud
Oracle in the Cloudzain1425
 

Similar to Introduction to Databus (20)

Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
 
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseSQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data Warehouse
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
 
Lug best practice_hpc_workflow
Lug best practice_hpc_workflowLug best practice_hpc_workflow
Lug best practice_hpc_workflow
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
How to Build a SaaS App With Twitter-like Throughput on Just 9 ServersHow to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
 
Times Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo LudasTimes Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo Ludas
 
Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_in
 
Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure
 
Lync 2010 High Availability
Lync 2010 High AvailabilityLync 2010 High Availability
Lync 2010 High Availability
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data Analysis
 
Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture Pipeline
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Zeroth review presentation - eBay Turmeric / SMC
Zeroth review presentation - eBay Turmeric / SMCZeroth review presentation - eBay Turmeric / SMC
Zeroth review presentation - eBay Turmeric / SMC
 
Top 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data GridTop 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data Grid
 
The 5 Stages of Scale
The 5 Stages of ScaleThe 5 Stages of Scale
The 5 Stages of Scale
 
Oracle in the Cloud
Oracle in the CloudOracle in the Cloud
Oracle in the Cloud
 

More from Amy W. Tang

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationAmy W. Tang
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInAmy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using HelixAmy W. Tang
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph PresentationAmy W. Tang
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Amy W. Tang
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State DrivesAmy W. Tang
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with HelixAmy W. Tang
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang
 

More from Amy W. Tang (11)

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data Application
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using Helix
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph Presentation
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with Helix
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Introduction to Databus

  • 1. Databus 1/29/2013 Recruiting Solutions ` Databus 1
  • 3. LinkedIn by Numbers  World’s largest professional network  187M+ members world-wide as of Q3 2012  Growing at the rate of two per second  85 of Fortune 100 companies use Talent Solutions to hire  > 2.6M company pages  > 4B search queries  75K+ developers leveraging out APIs  1.3M unique publishers ` Databus 3
  • 4. The Consequence of Specialization in Data Systems Data Flow is essential Data Consistency is critical !!! `
  • 5. Solution: Databus Standardi Standardi Standardi Standardi Standardi Standardi Standardi Standardi Standardi Search Graph Read Updates zation zation zation zation zation zation zation zation zation Index Index Replicas Primary DB Data Change Events Databus ` 5
  • 6. Two Ways Application code dual Extract changes from writes to database and database commit log pub-sub system Easy on the surface Tough but possible Consistent? Consistent!!! `
  • 7. Key Design Decisions : Semantics • Logical clocks attached to the source – Physical offsets could be used for internal transport – Simplifies data portability • Pull model – Restarts are simple – Derived State = f (Source state, Clock) – + Idempotence = Timeline Consistent! ` 7
  • 8. Key Design Decisions : Systems • Isolate fast consumers from slow consumers – Workload separation between online, catch-up, bootstrap • Isolate sources from consumers – Schema changes – Physical layout changes – Speed mismatch • Schema-aware – Filtering, Projections – Typically network-bound  can burn more CPU ` 8
  • 9. Requirements • Timeline consistency • Guaranteed, at least once delivery • Low latency • Schema evolution • Source independence • Scalable consumers • Handle for slow/new consumers without affecting happy ones (look-back requirements) ` 9
  • 11. 0 Initial Design (2007) Happy Consumer Source clock timer … SCN Direct Pull Relay Happy DB In Memory Consumer 70000 Buffer Proxied 3 hrs Pull 100000 Relay 102400 Slow DB Consumer Pros: Cons: 1. Consumer Scaling Slow consumers overwhelming the DB 2. Some isolation ` Databus 11
  • 12. Software Architecture Four Logical Components • Fetcher – Fetch from db, relay… • Log Store – Store log snippet • Snapshot Store – Store moving data snapshot • Subscription Client – Orchestrate pull across these `
  • 13. 0 Source clock timer SCN The Databus System Happy Snapshot Consumer … infinite 30000 Log Relay Happy 10 days In Memory Consumer 70000 Relay Buffer 80000 90000 3 hrs 100000 102400 Slow DB Consumer Server Log Storage Snapshot Store Bootstrap Service ` 13
  • 14. The Relay • Change event buffering (~ 2 – 7 days) • Low latency (10-15 ms) • Filtering, Projection • Hundreds of consumers per relay • Scale-out, High-availability through redundancy `
  • 15. Deployment Options Option 1: Peered Deployment Option 2: Clustered Deployment `
  • 16. The Bootstrap Service • Catch-all for slow / new consumers • Isolate source OLTP instance from large scans • Log Store + Snapshot Store • Optimizations – Periodic merge – Predicate push-down – Catch-up versus full bootstrap • Guaranteed progress for consumers via chunking • Implementations – Database (MySQL) – Raw Files • Bridges the continuum between stream and batch systems `
  • 17. The Consumer Client Library • Glue between Databus infra and business logic in the consumer • Isolates the consumer from changes in the databus layer. • Switches between relay and bootstrap as needed • API – Callback with transactions – Iterators over windows `
  • 18. Fetcher Implementations • Oracle – Trigger-based • MySQL – Custom-storage-engine based • In Labs – Alternative implementations for Oracle – OpenReplicator integration for MySQL `
  • 19. Meta-data Management • Event definition, serialization and transport – Avro • Oracle, MySQL – Avro definition generated from the table schema • Schema evolution – Only backwards-compatible changes allowed • Isolation between upgrades on producer and consumer `
  • 20. Scaling the consumers (Partitioning) • Server-side filtering – Range, mod, hash – Allows client to control partitioning function • Consumer groups – Distribute partitions evenly across a group – Move partitions to available consumers on failure – Minimize re-processing `
  • 22. Development with Databus – Client Library Databus Client Consumers Consumers implement Stream Event Bootstrap Event Callback Callback Client API API API Databus Client Library onDataEvent(DbusEvent, Decoder) register(consumers, sources , filter) … start() , … shutdown(), ` Databus 22
  • 23. Databus Consumer Implementation class MyConsumer extends AbstractDatabusStreamConsumer { ConsumerCallbackResult onDataEvent(DbusEvent e, DbusEventDecoder d){ //use map-like Avro GenericRecord GenericRecord g = d.getGenericRecord(e, null); //or use the auto-generated Java class MyEvent e = d.getTypedValue(e, null, MyEvent.class); … return ConsumerCallbackResult.SUCCESS; } } ` Databus 23
  • 24. Starting the client public void main(String[]) { //configure DatabusHttpClientImpl.Config clientConfig = new DatabusHttpClientImpl.Config(); clientConfig.loadFromFile(“mydbus”, “mdbus.props”); DatabusHttpClientImpl client = new DatabusHttpClientImpl(clientConfig); //register callback MyConsumer callback = new MyConsumer(); client.registerDatabusStreamListener(callback, null, "com.linkedin.events.member2.MemberProfile”); //start client library client.startAndBlock(); } ` Databus 24
  • 25. Event Callback APIs • ` Databus 25
  • 27. Relay Throughput ` Databus 27
  • 28. Consumer Throughput ` Databus 28
  • 29. End-End Latency ` Databus 29
  • 30. Snapshot vs Catchup ` Databus 30