SlideShare a Scribd company logo
1 of 21
Big Data @

         Using Big Data to Grow our Business
               & Retain our Customers.

                         Jerome Boulon
           Lead Architect, Hadoop Big Data Infrastructure

                         February 15, 2012
jboulon@netflix.com
Big Data @ Netflix
Offline analysis:
•  Honu: Scalable log analysis system to gain business
   insights:
   –  Errors logs (unstructured logs)
   –  Statistical logs & Performance logs
   –  Etc

Online analysis:
•  Cassandra for all online activities and user facing
   data
   –  A/B testing (test allocation, metadata)
   –  Service level Configuration
   –  etc
                                  2
Overview
                            Data collection pipeline


Applica'on	
                                           Collectors	
  




                 Hive	
                                    M/R	
  




                            Data processing pipeline
                                       3
Honu - Structured Log API
Using	
  Annota+ons	
           Using the Key/Value API
•  Convert Java Class to Hive   •  Produce the same result as
   Table dynamically               Annotation
•  Add/Remove column            •  Avoid unnecessary object
•  Supported java types:           creation
        •  All primitives       •  Fully dynamic
        •  Map                  •  Thread Safe
        •  Object using the
           toString method
Honu, What you get:




log.logEvent(myObject)
                                        Hive table
                         movieId customerId timestamp hostname



      Select customerId, count(1) from MyTable group by customerId;
December 2009
                                                                          Collectors	
  
–    POC for Streaming analysis                Applica'on	
  
–    Single AWS zone
–    1 application
–    60 Millions events/Day
–    50 clients
–    Small Hadoop cluster         Oracle	
  
–    1 Map/Reduce
–    1 Table
                                                                M/R	
  
Feb 2012
                                                40+ Billion events/Day
                                                 8+ tables with 1+TB/Day
                                                100+ smaller tables
                                                Self-serve:
                                                à No DBA
                                                à No Pre-provisioning


                                 	
            	
  
                                                à Fully integrated with Hive
- Multi Regions deployments
- Transparent to our engineers
- Streaming based solution
- Zero configuration
- 7000+ clients
- Built-in:                                           Netflix Hive warehouse
  - Fail-Over
  - Load balancing
                                        	
       à One central Data warehouse
                                                 à Hourly/Daily reports
                                                 à Data retention/expiration
Traceability & Performance
              analysis
•  Track service level call
   –  Instrument low level HTTP client
   –  Calls graph
   –  Request processing vs Perceive latency
   –  Payload marshalling/unmarshalling
      - duration, size, etc
   –  Service Result
      - Status, Error code, Exception, etc
Diagnostic Information
•  Collect latency information for all external
   operations
•  If Latency > threshold log to Honu:
    –  AWS Region & Zone
    –  Instance
    –  Service details
•  Open Jira/Ticket & Attach diagnostic info
Mix Offline and Online Data
Offline data                             Specific conditions
- Fire & forget                          - Online Data availability is not mandatory
- Scale to very large volumes            - If exist, data could be useful online
- Cost effective                         - Only a subset useful Online
                                         - Ready to pay a little bit more




 Special collectors                        Customer support
 - All data goes to Hive                   - Browsing history
 - A subset goes to a real-time system     - Historical & non-critical actions
 - Still cost effective                    Debug
                                           - Push validation
                                           - Root cause analysis
Honu Realtime usages
•  Movie playback experience        •  Customer Support
   –  Video quality                     –  Historical usage
   –  Network issue                     –  Last activity



•  Errors Summary                   •  Launch Reports
    –  Error tracking per service       –  Push validation
    –  Error tracking per device        –  Root cause analysis
Honu Realtime - Architecture
                 Realtime Data collection pipeline


Applica'on	
                                         Collectors	
  




   Real'me	
  
    Access	
  
                             Realtime
                             System                         M/R	
  
A/B Testing
 Test: An experiment where several
 competing behaviors are
 implemented and compared.

 Cell: different experiences within a
 test that are being compared against
 each other.

 Allocation: a customer-specific
 assignment to a cell within a test

Online data:                            Tracking       1 M customers per Test
- Cell Allocation > 1 Billion records   information    8 tracking events per Day
- Test config: 1 entry/test/customer    (example)     ------------------------------------
                                        100 Tests =   800 M events/ Day
                                        3 Months =       72 B events
Movie Presentation A/B Test
A/B Testing - Architecture
         Online Data            Offline Data




- Customer test allocation   - Test tracking
- Metadata about the test    Ex:
Ex:                          - Retention
- Start/End date             - Engagement metrics
- UI directives
- Logging directives
Beacon Server

User behavior
- Client side interactions
- Search/Play/Stop/Pause
                                           Ajax calls
Device monitoring
- Heartbeat
- Status & Key metrics        Beacon	
      Beacon	
     Beacon	
  
BI Integration
Three main technologies

•  Teradata (Data center)
•  Hive (Cloud)
•  Cassandra (Cloud)
Hive ß à BI
–  Dimension tables (daily export from Teradata)
–  Hourly/Daily Hive summary queries
–  Hourly/Daily export from Hive to BI
  •  Queries runs in the cloud
  •  Aggregated result goes back to our BI solution
Hive Reports
Cassandra à BI

•  Use Cassandra backups to run analytics
•  Export SSTable to Hadoop
•  Pig to:
  –  Parse SSTable
  –  Extract/Group required information
•  Load the result back to Teradata
jboulon@gmail.com
www.linkedin.com/in/jboulon

More Related Content

What's hot

How Disney+ uses fast data ubiquity to improve the customer experience
 How Disney+ uses fast data ubiquity to improve the customer experience  How Disney+ uses fast data ubiquity to improve the customer experience
How Disney+ uses fast data ubiquity to improve the customer experience Martin Zapletal
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Spark Summit
 
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...HostedbyConfluent
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Real Time Data Infrastructure team overview
Real Time Data Infrastructure team overviewReal Time Data Infrastructure team overview
Real Time Data Infrastructure team overviewMonal Daxini
 
Druid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiDruid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiBrian Olsen
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...Amazon Web Services
 
The Netflix data platform: Now and in the future by Kurt Brown
The Netflix data platform: Now and in the future by Kurt BrownThe Netflix data platform: Now and in the future by Kurt Brown
The Netflix data platform: Now and in the future by Kurt BrownData Con LA
 
Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020Timothy Spann
 
Netflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwikiNetflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwikiKevin McEntee
 
Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesGetting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesSingleStore
 
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...confluent
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at AirbnbHao Wang
 
A unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
A unified analytics platform with Kafka and Flink | Stephan Ewen, VervericaA unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
A unified analytics platform with Kafka and Flink | Stephan Ewen, VervericaHostedbyConfluent
 
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...HostedbyConfluent
 
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...HostedbyConfluent
 
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data TechBig Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data TechHostedbyConfluent
 

What's hot (20)

ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
How Disney+ uses fast data ubiquity to improve the customer experience
 How Disney+ uses fast data ubiquity to improve the customer experience  How Disney+ uses fast data ubiquity to improve the customer experience
How Disney+ uses fast data ubiquity to improve the customer experience
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
 
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Real Time Data Infrastructure team overview
Real Time Data Infrastructure team overviewReal Time Data Infrastructure team overview
Real Time Data Infrastructure team overview
 
Instrumenting your Instruments
Instrumenting your Instruments Instrumenting your Instruments
Instrumenting your Instruments
 
Druid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiDruid Overview by Rachel Pedreschi
Druid Overview by Rachel Pedreschi
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
 
The Netflix data platform: Now and in the future by Kurt Brown
The Netflix data platform: Now and in the future by Kurt BrownThe Netflix data platform: Now and in the future by Kurt Brown
The Netflix data platform: Now and in the future by Kurt Brown
 
Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020
 
Netflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwikiNetflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwiki
 
Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesGetting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming Architectures
 
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at Airbnb
 
A unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
A unified analytics platform with Kafka and Flink | Stephan Ewen, VervericaA unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
A unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
 
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
 
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
 
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data TechBig Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
 

Viewers also liked

The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original ProgrammingThe Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programminghye-jin-lee
 
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILES
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILESACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILES
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILESAdrija Chowdhury
 
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScaleDataWorks Summit
 
Mobile Phone Based Drunk driving detection
Mobile Phone Based Drunk driving detectionMobile Phone Based Drunk driving detection
Mobile Phone Based Drunk driving detectionnagarajc007
 
Netflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of AnalyticsNetflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of AnalyticsBlake Irvine
 

Viewers also liked (6)

The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original ProgrammingThe Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
 
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILES
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILESACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILES
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILES
 
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte Scale
 
Mobile Phone Based Drunk driving detection
Mobile Phone Based Drunk driving detectionMobile Phone Based Drunk driving detection
Mobile Phone Based Drunk driving detection
 
Netflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of AnalyticsNetflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of Analytics
 

Similar to Cloud Connect 2012, Big Data @ Netflix

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantageAmazon Web Services
 
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...SL Corporation
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupSri Ambati
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaKai Wähner
 
Resistance is futile, resilience is crucial
Resistance is futile, resilience is crucialResistance is futile, resilience is crucial
Resistance is futile, resilience is crucialHristo Iliev
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...Amazon Web Services
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...SL Corporation
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time AnalyticsAmazon Web Services
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networkspbelko82
 
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroDevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroGaurav "GP" Pal
 
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data AnalyticsCombining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data AnalyticsDataWorks Summit
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Michael Hiskey
 

Similar to Cloud Connect 2012, Big Data @ Netflix (20)

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville Meetup
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
 
Resistance is futile, resilience is crucial
Resistance is futile, resilience is crucialResistance is futile, resilience is crucial
Resistance is futile, resilience is crucial
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroDevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
 
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data AnalyticsCombining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 

Recently uploaded

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Recently uploaded (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Cloud Connect 2012, Big Data @ Netflix

  • 1. Big Data @ Using Big Data to Grow our Business & Retain our Customers. Jerome Boulon Lead Architect, Hadoop Big Data Infrastructure February 15, 2012 jboulon@netflix.com
  • 2. Big Data @ Netflix Offline analysis: •  Honu: Scalable log analysis system to gain business insights: –  Errors logs (unstructured logs) –  Statistical logs & Performance logs –  Etc Online analysis: •  Cassandra for all online activities and user facing data –  A/B testing (test allocation, metadata) –  Service level Configuration –  etc 2
  • 3. Overview Data collection pipeline Applica'on   Collectors   Hive   M/R   Data processing pipeline 3
  • 4. Honu - Structured Log API Using  Annota+ons   Using the Key/Value API •  Convert Java Class to Hive •  Produce the same result as Table dynamically Annotation •  Add/Remove column •  Avoid unnecessary object •  Supported java types: creation •  All primitives •  Fully dynamic •  Map •  Thread Safe •  Object using the toString method
  • 5. Honu, What you get: log.logEvent(myObject) Hive table movieId customerId timestamp hostname Select customerId, count(1) from MyTable group by customerId;
  • 6. December 2009 Collectors   –  POC for Streaming analysis Applica'on   –  Single AWS zone –  1 application –  60 Millions events/Day –  50 clients –  Small Hadoop cluster Oracle   –  1 Map/Reduce –  1 Table M/R  
  • 7. Feb 2012 40+ Billion events/Day 8+ tables with 1+TB/Day 100+ smaller tables Self-serve: à No DBA à No Pre-provisioning     à Fully integrated with Hive - Multi Regions deployments - Transparent to our engineers - Streaming based solution - Zero configuration - 7000+ clients - Built-in: Netflix Hive warehouse - Fail-Over - Load balancing   à One central Data warehouse à Hourly/Daily reports à Data retention/expiration
  • 8. Traceability & Performance analysis •  Track service level call –  Instrument low level HTTP client –  Calls graph –  Request processing vs Perceive latency –  Payload marshalling/unmarshalling - duration, size, etc –  Service Result - Status, Error code, Exception, etc
  • 9. Diagnostic Information •  Collect latency information for all external operations •  If Latency > threshold log to Honu: –  AWS Region & Zone –  Instance –  Service details •  Open Jira/Ticket & Attach diagnostic info
  • 10. Mix Offline and Online Data Offline data Specific conditions - Fire & forget - Online Data availability is not mandatory - Scale to very large volumes - If exist, data could be useful online - Cost effective - Only a subset useful Online - Ready to pay a little bit more Special collectors Customer support - All data goes to Hive - Browsing history - A subset goes to a real-time system - Historical & non-critical actions - Still cost effective Debug - Push validation - Root cause analysis
  • 11. Honu Realtime usages •  Movie playback experience •  Customer Support –  Video quality –  Historical usage –  Network issue –  Last activity •  Errors Summary •  Launch Reports –  Error tracking per service –  Push validation –  Error tracking per device –  Root cause analysis
  • 12. Honu Realtime - Architecture Realtime Data collection pipeline Applica'on   Collectors   Real'me   Access   Realtime System M/R  
  • 13. A/B Testing Test: An experiment where several competing behaviors are implemented and compared. Cell: different experiences within a test that are being compared against each other. Allocation: a customer-specific assignment to a cell within a test Online data: Tracking 1 M customers per Test - Cell Allocation > 1 Billion records information 8 tracking events per Day - Test config: 1 entry/test/customer (example) ------------------------------------ 100 Tests = 800 M events/ Day 3 Months = 72 B events
  • 15. A/B Testing - Architecture Online Data Offline Data - Customer test allocation - Test tracking - Metadata about the test Ex: Ex: - Retention - Start/End date - Engagement metrics - UI directives - Logging directives
  • 16. Beacon Server User behavior - Client side interactions - Search/Play/Stop/Pause Ajax calls Device monitoring - Heartbeat - Status & Key metrics Beacon   Beacon   Beacon  
  • 17. BI Integration Three main technologies •  Teradata (Data center) •  Hive (Cloud) •  Cassandra (Cloud)
  • 18. Hive ß à BI –  Dimension tables (daily export from Teradata) –  Hourly/Daily Hive summary queries –  Hourly/Daily export from Hive to BI •  Queries runs in the cloud •  Aggregated result goes back to our BI solution
  • 20. Cassandra à BI •  Use Cassandra backups to run analytics •  Export SSTable to Hadoop •  Pig to: –  Parse SSTable –  Extract/Group required information •  Load the result back to Teradata