Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

LinkedIn describes their journey in transitioning from a traditional warehouse to a self-serve reporting platform on Hadoop.

  • Be the first to comment

Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

  1. 1. Self-Serve Reporting Platform on Hadoop Shirshanka Das Strata Singapore 2015
  2. 2. 4 Ingest Process Serve Visualize Reporting Pipelines
  3. 3. 5 Ingest Process Serve Visualize Reporting at LinkedIn: Evolution Sources Oracle MSTR Tableau Internal Tools Espresso Kafka External Custom Custom Custom Hadoop Voldemort Pinot MySQL INFA + MSTR OracleTeradataon+ Scripts Jobs on
  4. 4. Infra Scale 6 Number of Hadoop clusters: 12 Total number of machines: ~7k Largest Cluster: ~3k machines Data volume generated per day: XX Terabytes Total accumulated data: XX Petabytes
  5. 5. People Scale 7 Reporting Platform Team: ~10 Core Warehouse Team: 1x Data Scientists: 10x Business Analysts: 10x Product Managers: 10x Sales and Marketing: 100x
  6. 6. 8 Ingest Process Serve Visualize Challenges Disjointed efforts, unreliable systems Unpredictable SLA across all systems Fragmented data pipelines with inconsistent data
  7. 7. 9 Ingest Process Serve Visualize
  8. 8. Houston we have a problem
  9. 9. Step 1 Central transport pipeline
  10. 10. Still have a problem
  11. 11. Step 2 Central Ingestion Framework 13
  12. 12. Stream + Batch REST SFTP JDBC Diverse Sources Open source @ In production @ LinkedIn, Intel, Swisscom, NerdWallet @LinkedIn ~20 distinct source types Hundreds of TB per day Hundreds of datasets Data Quality
  13. 13. 15 Ingest Process Serve Visualize Unified Metrics Platform
  14. 14. 16 Single Source of Truth Easy Onboarding Operability Requirements
  15. 15. Workflow Metric Definition Sandbox Code Repository Metric Owner System Jobs Build Core Metrics Job Central Team, Relevant Stakeholders 1. iterate 2. create 3. review 4. check in
  16. 16. Metric Definition Name Description TagsOwners Dataset Dimensions Time Script Metrics Entity Ids Tier Formulas Entity Dimensions Input Datasets Temporality
  17. 17. An example: video play analysis name: "video" description: “Metrics for video tracking” label: “video” tags: [flagship, feed] owners: [jdoe, jsmith] enabled: true retention: 90d timestamp: timestamp frequency: daily script: video_play.pig output_window: 1d dimensions:[ { name: platform doc: “phone, tablet or desktop" } { name: action_type doc: “click play or auto-play“ } ] input_datasets [ { name: actionsRaw path: Tracking.ActionEvent range: 1d } ]
  18. 18. An example contd… metrics: [ name: unique_viewers doc: “Count of unique viewers” formula: “unique(member_id)” tier: 2 good_direction: "up" } { name: play_actions doc: “Sum of play actions" tier: 2 formula: “sum(play_actions)" good_direction: "up" } ] entity_ids: [ { name: member_id category: member } { name:video_id category: video } ]
  19. 19. UMP Data FlowUmp Monitor Primary Data (tracking, databases, external) UMP Raw Data UMP Aggregated Data Relevance Experiment analysis Ad-hoc Metrics Script Data Prep agg cube dimension verify HDFS + Pinot Dashboards …
  20. 20. First version in production since early 2014 Significant redesign in 2015 Total amount of data being scanned per day: Hundreds of TBs Total number of metrics being computed: 2k+ Total number of scripts: ~ 400 Number of authors for these metrics: ~ 200 Maximum number of dimensions per dataset: ~ 30 Number of people responsible for upkeep of pipeline: 2 UMP by the numbers 22
  21. 21. Learnings so far 23 Ease of onboarding Hard when you have > 1000 users with different skill sets Need great UX to complement developer friendly alternatives Single source of truth Not just a technology challenge Organization needs to rally around it Operability Multi-tenant Hadoop pipeline with SLA-s and QoS: hard Cost 2 Serve: Managing metrics lifecycle is important The Next Big Things Bridging streaming and batch Code-free metrics Sessions, Funnels, Cohorts Open source
  22. 22. 24 Ingest Process Serve Visualize P not
  23. 23. SQL-like interface (minus joins) Sub second query latency Data load from Hadoop and Kafka Capabilities
  24. 24. Pinot Data Flow Kafka Hadoop Samza Process Pinot minutes hour +
  25. 25. Pinot@LinkedIn Site-­‐facing  Apps Reporting  dashboards Monitoring In production since 2012 Open source @
  26. 26. 28 Ingest Process Serve Visualize Raptor
  27. 27. Standardize Visualization 29 Leverage - Standalone app, with support for embedding - Can use existing analytics backend: Pinot Strategic - Reduces dependency on 3rd party BI tools - Closer integration with LinkedIn’s ecosystem of experimentation, anomaly detection solutions
  28. 28. 30 Requirements Support apps ecosystem Core Visualization Capabilities Metadata Integration
  29. 29. Raptor 1.0 31 First version built by 3 engineers in a quarter Features - Integration with UMP, Pinot - Time series, bar charts, … - Create, Publish, Clone, Discover Dashboards Numbers - Number of dashboards: ~100 - Weekly unique users: ~400
  30. 30. The Future for Raptor 39 Social Collaboration features Intelligence - Anomaly detection - Dashboards You May Like Embedding into data products Open Source
  31. 31. 40 Ingest Process Serve Visualize A Few Good Hammers Unified Metrics Platform P not Raptor
  32. 32. 41 Ingest Process Serve Visualize What we’re excited about Unified Metrics Platform P not Raptor Metadata Bus
  33. 33. 42 Metadata driven e2e Optimizations Dynamic prioritization of data ingest Surface source data quality issues in dashboard Surface backfill status on dashboard Cascading deprecation of dashboards, computation and data sources through lineage
  34. 34. 43 Shirshanka Das @shirshanka Catch me offline to chat about… What we’re doing for - Views on Hadoop - Data Quality - Metadata