SlideShare a Scribd company logo
1 of 76
Download to read offline
More Than Websites



                         And The Firehose

                            @
Saturday, 23 March 13
Introduce Yourselves   @
Saturday, 23 March 13
@stuherbert
                                      @
Saturday, 23 March 13
What is




                                  @
Saturday, 23 March 13
Sift through
                              social data
                        Twitter firehose, Facebook, bitly clicks,
                               news, videos, comments
                                       and more



                                                                    @
Saturday, 23 March 13
Gain insights using
                         augmentations
                           Language, gender, trends, links,
                         sentiment, salience & entity analysis
                                      and more



                                                                 @
Saturday, 23 March 13
Realtime
                        Get matching data within
                                 seconds
                           of it being posted




                                                   @
Saturday, 23 March 13
Historics
                        Search our social data archive
                               going back to
                                January 2010




                                                         @
Saturday, 23 March 13
Pull the data
                             from our servers
                         via HTTP/1.1 streaming
                              or websockets




                                                  @
Saturday, 23 March 13
Let us push
                          data to you
                        Have the data delivered directly
                                to your servers
                            or into your databases



                                                           @
Saturday, 23 March 13
in numbers




                                     @
Saturday, 23 March 13
30
                         Sources of social data
                        and data augmentations




                                                  @
Saturday, 23 March 13
Up to 20,000
                        Number of new pieces of data
                           ingested into DataSift
                               every second




                                                       @
Saturday, 23 March 13
3 Terabytes
                        Amount of new data added
                         to the Historics archive
                               every week




                                                    @
Saturday, 23 March 13
12
                        Different ways
                        we can deliver
                         data to you




                                         @
Saturday, 23 March 13
1
                        Average number of seconds
                             to pass the data
                             through DataSift




                                                    @
Saturday, 23 March 13
12
                        Number of services
                        data passes through
                          inside DataSift




                                              @
Saturday, 23 March 13
25
                        Number of engineers
                         who write code for
                        the DataSift platform




                                                @
Saturday, 23 March 13
5
                        Primary programming languages:
                         C++, Node, PHP, Python, Scala




                                                         @
Saturday, 23 March 13
154
                        Private GitHub repos




                                               @
Saturday, 23 March 13
PHP
      Java & Scala
          C & C++
        JS & Node
      Unclassified
             Python
       Shell Script
                 Ruby
                    C#
                VimL
                         0   15   30   45       60




 Our GitHub Repositories                    @
Saturday, 23 March 13
Architecture




                                       @
Saturday, 23 March 13
Three major
                        data pipelines
                          + supporting services




                                                  @
Saturday, 23 March 13
Data Archiving
                          Adds new data to the
                            Historics Archive




                                                 @
Saturday, 23 March 13
Filtering Pipeline
                          Filtering and delivery of data
                                    in realtime




                                                           @
Saturday, 23 March 13
Playback Pipeline
                          Filtering and delivery of data
                           from the Historics Archive




                                                           @
Saturday, 23 March 13
DataSift Architecture 2.2                                                                                                                                                                                                                        HBase Cluster
                                     @lorenzoalberton
                                                                                                                                      Data ingestion + Augmentation
                                                                                                                                                                                                                                              Ultrahose                                                                                         HDFS




                                                                                                                                                                                                                           Kafka
                                    Input Streams             Goblin Head        Goblin Tail
                                                                                                                                                                                                                                              Ultrahose                             Region 1              Region 2    ...   Region N
                                                                                                       Msg splitter
                                                                                                                                                                                                                                              Archiver
                                                                                                                                                                                                                                               Archiver                                                                                        Archiver
                                                                                                         Stream
                                        Twitter               Goblin Head        Goblin Tail          Splitter/Joiner
                                                                                                        Deduper
                                                              Goblin Head        Goblin Tail                                                                            Augmentation Pipeline
                                         Bit.ly
     HttpStreaming, PuSH, Search




                                                                                                                        Redis
                                                                                                                                    Deletes




                                                                                        Ogre

                                                                                               Ogre

                                                                                                        Ogre
                                                                                                                                   Processor                                   Language                                                                                         Hadoop
                                     Facebook                                                                                                                                  Detection                            100%                                                            Data Node Data Node Data Node                          Data Node   Data Node
                                     Wikipedia                                                                                    Ogre
                                                                                                                                                                                                                                                                                                                                                                   Interaction   ...   Interaction
                                                                                                                                   Ogre
                                                                                                                                   Ogre
                                                                                                                                    Ogre
                                                                                                                                    Ogre                                        Trends             Sentiment                                                                                                                                                         Targets             Targets
                                       Reddit                                                                                      Ogre                   Demographics
                                                                                                                                                                                Analysis            Analysis                                                                                                                                                        Mapping             Mapping
                                     LexisNexis                                       Interaction
                                                                                       Interaction                                Ogre
                                                                                                                                   Ogre
                                                                                                                                   Ogre
                                                                                                                                    Ogre
                                                                                                                                    Ogre
                                                                                                                                   Ogre
                                                                                                                                                                                                                                                                                                                                  ...
                                                                                      Generation
                                                                                      Generation
                                     Meltwater                                                                                    Ogre
                                                                                                                                   Ogre
                                                                                                                                   Ogre
                                                                                                                                    Ogre
                                                                                                                                    Ogre                      Topics             Klout              Named
                                      Estimize
                                                                                                                                   Ogre
                                                                                                                                                             Analysis        Score + Profile         Entities                                                                                                                                                       Filtering           Filtering
                                        Digg                                                             Ogre                                                                                                                                                                                                                                                       Tardis       ...    Tardis
                                                                                                          Ogre
                                                                                                          Ogre
                                                                                                           Ogre
                                                                                                           Ogre                 Links Resolution
                                                                                                          Ogre
                                                                                                                                                                             3rd party APIs                                                                                                                                                                         Pickle              Pickle
                                                                                                                                  + OpenGraph                                                                                                  Stream




                                                                                                                                                                                                                               Kafka
                                                                                                         Ogre
                                                                                                          Ogre
                                                                                                          Ogre
                                                                                                           Ogre
                                                                                                           Ogre
                                                                                                          Ogre                   + Twitter Cards
                                     NewsCred                                                            Ogre                                                                                                                                 Recorder                              Map/Reduce                                             Historical Queries
                                                                                                          Ogre
                                                                                                          Ogre
                                                                                                           Ogre
                                                                                                           Ogre                    + Metadata
                                    BoardReader
                                                                                                          Ogre                                                           100%
                                     MySpace                                                                                                                                                                                                                                                                         Titan Historics                                    Stream results
                                    SuperFeeder
                                                                                                                                                                                Prism                           Control                                                              jobs               chunks                   chunk                   job
                                                                                                                                                                     100%                  100%                Channels                                                               DB                  DB                    selector               tracker

                                                   Historics                                                                                                                                                                                                              s
                                                                                                                                                                                                                                                                       ult                                 Time Machine + Insights
                                                  Scheduler
                                                                                 PickleDB               .                                                                                                                                                           res                                 Post-Processing, Stream Analytics
                                                                                                        DB                                                Node Shard
                                                                                                                                                           Node Shard                           Node Shard
                                                                                                                                                                                                 Node Shard                                                    am
                                                                                                                                                                                                                                                          re
                                                                                                                                                                                                                                                        St
                                                                                                                                                        Pickle
                                                                                                                                                         Pickle    Pickle
                                                                                                                                                                    Pickle    Pickle          Pickle
                                                                                                                                                                                               Pickle     Pickle
                                                                                                                                                                                                           Pickle
                                                  Recording                                                                                             Node
                                                                                                                                                         Node      Node
                                                                                                                                                                    Node                      Node
                                                                                                                                                                                               Node       Node
                                                                                                                                                                                                           Node
                                                                                 CSDL Compiler,                                                                              Filtering
                                                  Scheduler                                                                                                    push                                  push
                                                                                    Validator,                                                          Pickle push Pickle
                                                                                                                                                                    Pickle
                                                                                                                                                                              Engine          Pickle push Pickle
                                                                                                                                                                                                          Pickle
                                                                                                                                                         Pickle                                Pickle
                                                                                   Normaliser                                                           Node
                                                                                                                                                         Node       Node
                                                                                                                                                                     Node                     Node
                                                                                                                                                                                               Node       Node
                                                                                                                                                                                                           Node
                                                    Push
                                                                                                                                                        Pickle
                                                                                                                                                         Pickle    Pickle
                                                                                                                                                                    Pickle                    Pickle
                                                                                                                                                                                               Pickle     Pickle
                                                                                                                                                                                                           Pickle
                                                  Scheduler                                                                                             Node       Node                       Node        Node                                                                                                                                                                            Exports and
                                                                                                                                                         Node       Node                       Node        Node
                                                                                                                                                                                                                                                                                                                                                                                           Analytics




                                                                                                                                                                                                                                                                               (D5) Hardware
                                                                                 Definition                .




                                                                                                                                                                                                                                                                               Load Balancer
                                                                                                        DB                                                                                                                                 Meteor          Node
                                                                                 Manager                 .
                                                                                                                                                                                                                                                                                                              WebSockets
                                                                                                                                                                                                                                                           Node                                                                                                                                @datasift
                                                                                                                                                                                                                                          Real-time
                                                                                                                                                                                                                                          Streams                                                           HTTPStreaming
                                                                                                                                                                                                                                                           Node
                                                                                 Stream                  .                                                                    ACL
                                                                                                                                                                               ACL
                                                                                                        DB                                               EDRs           (with interaction
                                                        API                      Manager                  .                                                              (with interaction
                                                                                                                                                      (licensed             counter)
                                                                                                                                                       content               counter)
                                                                                                                                                       metrics)
                                                                                 Mask                                                                                                                                                   Snapshotter       Worker
                                                                                                        DB
                                                    WEB                          Manager                                                                                                                                                                                        HTTP Request
                                                                                                                                                                                                                                                          Worker
                                                                                                                                                                                                                                       Buffered Redis
                                                                                                                                                                                                                                       Streams                                       GET batch
                                                                                                                                                                                                                                                          Worker
                                                                                                                                                                                                                                                                                                                               Delivery Subscriptions
                                                                                                                                           Monitoring
                                                                                                                                            Kafka                                    Connection
                                                                                                                                            Queue                                     Manager
                                                                      Authentication                                                                                                                                                       PUSH                                                    PUSH                          job queue
                                                                                  DB
                                                                      Manager                                                                                                                                                             Producer                                                Scheduler
                                                                                                                                            tracker
                                                                      Billing
                                                                      Pipeline         DB                                                                                            Connections
                                                                                                                                                                                       Storage                                                                      Subscriptions
                                                                                                                                                                                                                                                                                                                                HTTP(S) POST
                                                                                                                                           Events                                                                                                                       DB                                                               (S)FTP
                           Notification                                License                                                              Storage
                            Service                                   Manager          DB                                                                                                                                                                                                                                               Amazon S3
                                                                                                                                                                                                                                                                                                                                                                                                Cloud Storage
                                                                                                                                                                                                                                                                                                                                         DynamoDB




                                                                                                                                                                                                                                                                                       kafka-consumer
                                                                                                                                                                                                                                                      subscription X                                                                    Microsoft Azure                                                DBs
                                                                      Limit                                                            Monitoring                                          Audit
                                                                                       DB                                                                                                                                                                                                                                                   MongoDB
                                                                      Manager                                                          Aggregator                                                                                                                                                                                                                                                    BI tools
                                                                                                                                                                                                                                                      subscription Y                                                                          Oracle
                                                                                                                                                                                                                                                                                                                             PUSH              CouchDB
                                                                            Stop                                                                                                                                                                                                                                            Delivery
                                                                                                                                                                                                                                                                                                                                               IBM Cognos
                                                                            PUB
                                                                                                                                                                                                                                                                                                                                              Google BigQuery




 DataSift Technical Architecture                                                                                                                                                                                                                                                                                                                                   @
Saturday, 23 March 13
DataSift Architecture 2.2                                                                                                                                                                                                                        HBase Cluster
                                     @lorenzoalberton
                                                                                                                                      Data ingestion + Augmentation
                                                                                                                                                                                                                                              Ultrahose                                                                                         HDFS




                                                                                                                                                                                                                           Kafka
                                    Input Streams             Goblin Head        Goblin Tail
                                                                                                                                                                                                                                              Ultrahose                             Region 1              Region 2    ...   Region N
                                                                                                       Msg splitter
                                                                                                                                                                                                                                              Archiver
                                                                                                                                                                                                                                               Archiver                                                                                        Archiver
                                                                                                         Stream
                                        Twitter               Goblin Head        Goblin Tail          Splitter/Joiner
                                                                                                        Deduper
                                                              Goblin Head        Goblin Tail                                                                            Augmentation Pipeline
                                         Bit.ly
     HttpStreaming, PuSH, Search




                                                                                                                        Redis
                                                                                                                                    Deletes




                                                                                        Ogre

                                                                                               Ogre

                                                                                                        Ogre
                                                                                                                                   Processor                                   Language                                                                                         Hadoop
                                     Facebook                                                                                                                                  Detection                            100%                                                            Data Node Data Node Data Node                          Data Node   Data Node
                                     Wikipedia                                                                                    Ogre
                                                                                                                                                                                                                                                                                                                                                                   Interaction   ...   Interaction
                                                                                                                                   Ogre
                                                                                                                                   Ogre
                                                                                                                                    Ogre
                                                                                                                                    Ogre                                        Trends             Sentiment                                                                                                                                                         Targets             Targets
                                       Reddit                                                                                      Ogre                   Demographics
                                                                                                                                                                                Analysis            Analysis                                                                                                                                                        Mapping             Mapping
                                     LexisNexis                                       Interaction
                                                                                       Interaction                                Ogre
                                                                                                                                   Ogre
                                                                                                                                   Ogre
                                                                                                                                    Ogre
                                                                                                                                    Ogre
                                                                                                                                   Ogre
                                                                                                                                                                                                                                                                                                                                  ...
                                                                                      Generation
                                                                                      Generation
                                     Meltwater                                                                                    Ogre
                                                                                                                                   Ogre
                                                                                                                                   Ogre
                                                                                                                                    Ogre
                                                                                                                                    Ogre                      Topics             Klout              Named
                                      Estimize
                                                                                                                                   Ogre
                                                                                                                                                             Analysis        Score + Profile         Entities                                                                                                                                                       Filtering           Filtering
                                        Digg                                                             Ogre                                                                                                                                                                                                                                                       Tardis       ...    Tardis
                                                                                                          Ogre
                                                                                                          Ogre
                                                                                                           Ogre
                                                                                                           Ogre                 Links Resolution
                                                                                                          Ogre
                                                                                                                                                                             3rd party APIs                                                                                                                                                                         Pickle              Pickle
                                                                                                                                  + OpenGraph                                                                                                  Stream




                                                                                                                                                                                                                               Kafka
                                                                                                         Ogre
                                                                                                          Ogre
                                                                                                          Ogre
                                                                                                           Ogre
                                                                                                           Ogre
                                                                                                          Ogre                   + Twitter Cards
                                     NewsCred                                                            Ogre                                                                                                                                 Recorder                              Map/Reduce                                             Historical Queries
                                                                                                          Ogre
                                                                                                          Ogre
                                                                                                           Ogre
                                                                                                           Ogre                    + Metadata
                                    BoardReader
                                                                                                          Ogre                                                           100%
                                     MySpace                                                                                                                                                                                                                                                                         Titan Historics                                    Stream results
                                    SuperFeeder
                                                                                                                                                                                Prism                           Control                                                              jobs               chunks                   chunk                   job
                                                                                                                                                                     100%                  100%                Channels                                                               DB                  DB                    selector               tracker

                                                   Historics                                                                                                                                                                                                              s
                                                                                                                                                                                                                                                                       ult                                 Time Machine + Insights
                                                  Scheduler
                                                                                 PickleDB               .                                                                                                                                                           res                                 Post-Processing, Stream Analytics
                                                                                                        DB                                                Node Shard
                                                                                                                                                           Node Shard                           Node Shard
                                                                                                                                                                                                 Node Shard                                                    am
                                                                                                                                                                                                                                                          re
                                                                                                                                                                                                                                                        St
                                                                                                                                                        Pickle
                                                                                                                                                         Pickle    Pickle
                                                                                                                                                                    Pickle    Pickle          Pickle
                                                                                                                                                                                               Pickle     Pickle
                                                                                                                                                                                                           Pickle
                                                  Recording                                                                                             Node
                                                                                                                                                         Node      Node
                                                                                                                                                                    Node                      Node
                                                                                                                                                                                               Node       Node
                                                                                                                                                                                                           Node
                                                                                 CSDL Compiler,                                                                              Filtering
                                                  Scheduler                                                                                                    push                                  push
                                                                                    Validator,                                                          Pickle push Pickle
                                                                                                                                                                    Pickle
                                                                                                                                                                              Engine          Pickle push Pickle
                                                                                                                                                                                                          Pickle
                                                                                                                                                         Pickle                                Pickle
                                                                                   Normaliser                                                           Node
                                                                                                                                                         Node       Node
                                                                                                                                                                     Node                     Node
                                                                                                                                                                                               Node       Node
                                                                                                                                                                                                           Node
                                                    Push
                                                                                                                                                        Pickle
                                                                                                                                                         Pickle    Pickle
                                                                                                                                                                    Pickle                    Pickle
                                                                                                                                                                                               Pickle     Pickle
                                                                                                                                                                                                           Pickle
                                                  Scheduler                                                                                             Node       Node                       Node        Node                                                                                                                                                                            Exports and
                                                                                                                                                         Node       Node                       Node        Node
                                                                                                                                                                                                                                                                                                                                                                                           Analytics




                                                                                                                                                                                                                                                                               (D5) Hardware
                                                                                 Definition                .




                                                                                                                                                                                                                                                                               Load Balancer
                                                                                                        DB                                                                                                                                 Meteor          Node
                                                                                 Manager                 .
                                                                                                                                                                                                                                                                                                              WebSockets
                                                                                                                                                                                                                                                           Node                                                                                                                                @datasift
                                                                                                                                                                                                                                          Real-time
                                                                                                                                                                                                                                          Streams                                                           HTTPStreaming
                                                                                                                                                                                                                                                           Node
                                                                                 Stream                  .                                                                    ACL
                                                                                                                                                                               ACL
                                                                                                        DB                                               EDRs           (with interaction
                                                        API                      Manager                  .                                                              (with interaction
                                                                                                                                                      (licensed             counter)
                                                                                                                                                       content               counter)
                                                                                                                                                       metrics)
                                                                                 Mask                                                                                                                                                   Snapshotter       Worker
                                                                                                        DB
                                                    WEB                          Manager                                                                                                                                                                                        HTTP Request
                                                                                                                                                                                                                                                          Worker
                                                                                                                                                                                                                                       Buffered Redis
                                                                                                                                                                                                                                       Streams                                       GET batch
                                                                                                                                                                                                                                                          Worker
                                                                                                                                                                                                                                                                                                                               Delivery Subscriptions
                                                                                                                                           Monitoring
                                                                                                                                            Kafka                                    Connection
                                                                                                                                            Queue                                     Manager
                                                                      Authentication                                                                                                                                                       PUSH                                                    PUSH                          job queue
                                                                                  DB
                                                                      Manager                                                                                                                                                             Producer                                                Scheduler
                                                                                                                                            tracker
                                                                      Billing
                                                                      Pipeline         DB                                                                                            Connections
                                                                                                                                                                                       Storage                                                                      Subscriptions
                                                                                                                                                                                                                                                                                                                                HTTP(S) POST
                                                                                                                                           Events                                                                                                                       DB                                                               (S)FTP
                           Notification                                License                                                              Storage
                            Service                                   Manager          DB                                                                                                                                                                                                                                               Amazon S3
                                                                                                                                                                                                                                                                                                                                                                                                Cloud Storage
                                                                                                                                                                                                                                                                                                                                         DynamoDB




                                                                                                                                                                                                                                                                                       kafka-consumer
                                                                                                                                                                                                                                                                                       kafka-consumer
                                                                                                                                                                                                                                                      subscription X                                                                    Microsoft Azure                                                DBs
                                                                      Limit                                                            Monitoring                                          Audit
                                                                                       DB                                                                                                                                                                                                                                                   MongoDB
                                                                      Manager                                                          Aggregator                                                                                                                                                                                                                                                    BI tools
                                                                                                                                                                                                                                                      subscription Y                                                                          Oracle
                                                                                                                                                                                                                                                                                                                             PUSH              CouchDB
                                                                            Stop                                                                                                                                                                                                                                            Delivery
                                                                                                                                                                                                                                                                                                                                               IBM Cognos
                                                                            PUB
                                                                                                                                                                                                                                                                                                                                              Google BigQuery




 Filtering Pipeline                                                                                                                                                                                                                                                                                                                                                @
Saturday, 23 March 13
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)

More Related Content

Viewers also liked

Analytics: The widening divide
Analytics: The widening divideAnalytics: The widening divide
Analytics: The widening divideBPMSinfo
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlowMatthias Feys
 
Scaling tokopedia-past-present-future
Scaling tokopedia-past-present-futureScaling tokopedia-past-present-future
Scaling tokopedia-past-present-futureRein Mahatma
 
Video Transcoding on Hadoop
Video Transcoding on HadoopVideo Transcoding on Hadoop
Video Transcoding on HadoopDataWorks Summit
 
From development environments to production deployments with Docker, Compose,...
From development environments to production deployments with Docker, Compose,...From development environments to production deployments with Docker, Compose,...
From development environments to production deployments with Docker, Compose,...Jérôme Petazzoni
 
Introducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFireIntroducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFireJohn Blum
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraMatthias Broecheler
 
Design in Tech Report 2017
Design in Tech Report 2017Design in Tech Report 2017
Design in Tech Report 2017John Maeda
 

Viewers also liked (9)

Analytics: The widening divide
Analytics: The widening divideAnalytics: The widening divide
Analytics: The widening divide
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
 
Scaling tokopedia-past-present-future
Scaling tokopedia-past-present-futureScaling tokopedia-past-present-future
Scaling tokopedia-past-present-future
 
Video Transcoding on Hadoop
Video Transcoding on HadoopVideo Transcoding on Hadoop
Video Transcoding on Hadoop
 
From development environments to production deployments with Docker, Compose,...
From development environments to production deployments with Docker, Compose,...From development environments to production deployments with Docker, Compose,...
From development environments to production deployments with Docker, Compose,...
 
Introducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFireIntroducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFire
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with Cassandra
 
Design in Tech Report 2017
Design in Tech Report 2017Design in Tech Report 2017
Design in Tech Report 2017
 

Recently uploaded

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 

More Than Websites: PHP And The Firehose @DataSift (2013)

  • 1. More Than Websites And The Firehose @ Saturday, 23 March 13
  • 2. Introduce Yourselves @ Saturday, 23 March 13
  • 3. @stuherbert @ Saturday, 23 March 13
  • 4. What is @ Saturday, 23 March 13
  • 5. Sift through social data Twitter firehose, Facebook, bitly clicks, news, videos, comments and more @ Saturday, 23 March 13
  • 6. Gain insights using augmentations Language, gender, trends, links, sentiment, salience & entity analysis and more @ Saturday, 23 March 13
  • 7. Realtime Get matching data within seconds of it being posted @ Saturday, 23 March 13
  • 8. Historics Search our social data archive going back to January 2010 @ Saturday, 23 March 13
  • 9. Pull the data from our servers via HTTP/1.1 streaming or websockets @ Saturday, 23 March 13
  • 10. Let us push data to you Have the data delivered directly to your servers or into your databases @ Saturday, 23 March 13
  • 11. in numbers @ Saturday, 23 March 13
  • 12. 30 Sources of social data and data augmentations @ Saturday, 23 March 13
  • 13. Up to 20,000 Number of new pieces of data ingested into DataSift every second @ Saturday, 23 March 13
  • 14. 3 Terabytes Amount of new data added to the Historics archive every week @ Saturday, 23 March 13
  • 15. 12 Different ways we can deliver data to you @ Saturday, 23 March 13
  • 16. 1 Average number of seconds to pass the data through DataSift @ Saturday, 23 March 13
  • 17. 12 Number of services data passes through inside DataSift @ Saturday, 23 March 13
  • 18. 25 Number of engineers who write code for the DataSift platform @ Saturday, 23 March 13
  • 19. 5 Primary programming languages: C++, Node, PHP, Python, Scala @ Saturday, 23 March 13
  • 20. 154 Private GitHub repos @ Saturday, 23 March 13
  • 21. PHP Java & Scala C & C++ JS & Node Unclassified Python Shell Script Ruby C# VimL 0 15 30 45 60 Our GitHub Repositories @ Saturday, 23 March 13
  • 22. Architecture @ Saturday, 23 March 13
  • 23. Three major data pipelines + supporting services @ Saturday, 23 March 13
  • 24. Data Archiving Adds new data to the Historics Archive @ Saturday, 23 March 13
  • 25. Filtering Pipeline Filtering and delivery of data in realtime @ Saturday, 23 March 13
  • 26. Playback Pipeline Filtering and delivery of data from the Historics Archive @ Saturday, 23 March 13
  • 27. DataSift Architecture 2.2 HBase Cluster @lorenzoalberton Data ingestion + Augmentation Ultrahose HDFS Kafka Input Streams Goblin Head Goblin Tail Ultrahose Region 1 Region 2 ... Region N Msg splitter Archiver Archiver Archiver Stream Twitter Goblin Head Goblin Tail Splitter/Joiner Deduper Goblin Head Goblin Tail Augmentation Pipeline Bit.ly HttpStreaming, PuSH, Search Redis Deletes Ogre Ogre Ogre Processor Language Hadoop Facebook Detection 100% Data Node Data Node Data Node Data Node Data Node Wikipedia Ogre Interaction ... Interaction Ogre Ogre Ogre Ogre Trends Sentiment Targets Targets Reddit Ogre Demographics Analysis Analysis Mapping Mapping LexisNexis Interaction Interaction Ogre Ogre Ogre Ogre Ogre Ogre ... Generation Generation Meltwater Ogre Ogre Ogre Ogre Ogre Topics Klout Named Estimize Ogre Analysis Score + Profile Entities Filtering Filtering Digg Ogre Tardis ... Tardis Ogre Ogre Ogre Ogre Links Resolution Ogre 3rd party APIs Pickle Pickle + OpenGraph Stream Kafka Ogre Ogre Ogre Ogre Ogre Ogre + Twitter Cards NewsCred Ogre Recorder Map/Reduce Historical Queries Ogre Ogre Ogre Ogre + Metadata BoardReader Ogre 100% MySpace Titan Historics Stream results SuperFeeder Prism Control jobs chunks chunk job 100% 100% Channels DB DB selector tracker Historics s ult Time Machine + Insights Scheduler PickleDB . res Post-Processing, Stream Analytics DB Node Shard Node Shard Node Shard Node Shard am re St Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Recording Node Node Node Node Node Node Node Node CSDL Compiler, Filtering Scheduler push push Validator, Pickle push Pickle Pickle Engine Pickle push Pickle Pickle Pickle Pickle Normaliser Node Node Node Node Node Node Node Node Push Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Scheduler Node Node Node Node Exports and Node Node Node Node Analytics (D5) Hardware Definition . Load Balancer DB Meteor Node Manager . WebSockets Node @datasift Real-time Streams HTTPStreaming Node Stream . ACL ACL DB EDRs (with interaction API Manager . (with interaction (licensed counter) content counter) metrics) Mask Snapshotter Worker DB WEB Manager HTTP Request Worker Buffered Redis Streams GET batch Worker Delivery Subscriptions Monitoring Kafka Connection Queue Manager Authentication PUSH PUSH job queue DB Manager Producer Scheduler tracker Billing Pipeline DB Connections Storage Subscriptions HTTP(S) POST Events DB (S)FTP Notification License Storage Service Manager DB Amazon S3 Cloud Storage DynamoDB kafka-consumer subscription X Microsoft Azure DBs Limit Monitoring Audit DB MongoDB Manager Aggregator BI tools subscription Y Oracle PUSH CouchDB Stop Delivery IBM Cognos PUB Google BigQuery DataSift Technical Architecture @ Saturday, 23 March 13
  • 28. DataSift Architecture 2.2 HBase Cluster @lorenzoalberton Data ingestion + Augmentation Ultrahose HDFS Kafka Input Streams Goblin Head Goblin Tail Ultrahose Region 1 Region 2 ... Region N Msg splitter Archiver Archiver Archiver Stream Twitter Goblin Head Goblin Tail Splitter/Joiner Deduper Goblin Head Goblin Tail Augmentation Pipeline Bit.ly HttpStreaming, PuSH, Search Redis Deletes Ogre Ogre Ogre Processor Language Hadoop Facebook Detection 100% Data Node Data Node Data Node Data Node Data Node Wikipedia Ogre Interaction ... Interaction Ogre Ogre Ogre Ogre Trends Sentiment Targets Targets Reddit Ogre Demographics Analysis Analysis Mapping Mapping LexisNexis Interaction Interaction Ogre Ogre Ogre Ogre Ogre Ogre ... Generation Generation Meltwater Ogre Ogre Ogre Ogre Ogre Topics Klout Named Estimize Ogre Analysis Score + Profile Entities Filtering Filtering Digg Ogre Tardis ... Tardis Ogre Ogre Ogre Ogre Links Resolution Ogre 3rd party APIs Pickle Pickle + OpenGraph Stream Kafka Ogre Ogre Ogre Ogre Ogre Ogre + Twitter Cards NewsCred Ogre Recorder Map/Reduce Historical Queries Ogre Ogre Ogre Ogre + Metadata BoardReader Ogre 100% MySpace Titan Historics Stream results SuperFeeder Prism Control jobs chunks chunk job 100% 100% Channels DB DB selector tracker Historics s ult Time Machine + Insights Scheduler PickleDB . res Post-Processing, Stream Analytics DB Node Shard Node Shard Node Shard Node Shard am re St Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Recording Node Node Node Node Node Node Node Node CSDL Compiler, Filtering Scheduler push push Validator, Pickle push Pickle Pickle Engine Pickle push Pickle Pickle Pickle Pickle Normaliser Node Node Node Node Node Node Node Node Push Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Scheduler Node Node Node Node Exports and Node Node Node Node Analytics (D5) Hardware Definition . Load Balancer DB Meteor Node Manager . WebSockets Node @datasift Real-time Streams HTTPStreaming Node Stream . ACL ACL DB EDRs (with interaction API Manager . (with interaction (licensed counter) content counter) metrics) Mask Snapshotter Worker DB WEB Manager HTTP Request Worker Buffered Redis Streams GET batch Worker Delivery Subscriptions Monitoring Kafka Connection Queue Manager Authentication PUSH PUSH job queue DB Manager Producer Scheduler tracker Billing Pipeline DB Connections Storage Subscriptions HTTP(S) POST Events DB (S)FTP Notification License Storage Service Manager DB Amazon S3 Cloud Storage DynamoDB kafka-consumer kafka-consumer subscription X Microsoft Azure DBs Limit Monitoring Audit DB MongoDB Manager Aggregator BI tools subscription Y Oracle PUSH CouchDB Stop Delivery IBM Cognos PUB Google BigQuery Filtering Pipeline @ Saturday, 23 March 13