Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Zeotap: Data Modeling in Druid for Non temporal and Nested Data

Druid has been the production workhorse for the past 2+ years at Zeotap powering the core Audience planning across our Connect and Targeting products. Though Druid is best suited for data having time as a dimension as it partitions data based on time first, we have used Druid to serve ML powered enhanced insights and Estimation of potential dataset sizes, to assist us with our core business case of Audience planning. These are datasets without timestamp a.k.a non-temporal with high scale and having nested dimensions. These have been achieved using nuanced data modelling to store the data sets and achieve millisecond latency retrieval on top of the same. The core of the presentation would be on the data modelling journey to achieve these use cases detailing the query access patterns. We also delve upon the architecture - ingestion into druid sink and processing including ML. In the end we go over the production setup and configurations and provide the performance tunings applied. The presentation would have the following heads:

The presentation would have the following heads
* Business case in Ad-Tech and Mar-Tech vertical
* Audience Planner Usecase 1 - Insights
-Lambda Architecture and data flow
-Deep dive on data model
-Takeaways
*Audience Planner Usecase 2 - Estimator
-Architecture and data flow
-Stratified sampling explained
-Data model to solve nested data - deep dive
-Takeaways
*Audience Planner Usecase 3 - Skew correction
-Skew correction model
-Query Access
-Data model in Druid to accommodate output from ML models
-Takeaways
*Production setup, config and Tunings
*Production Operation experience takeaways

  • Be the first to comment

Zeotap: Data Modeling in Druid for Non temporal and Nested Data

  1. 1. Data Modelling in Druid for Non-temporal and Nested data Sathish KS and Chaitanya Bendre, Zeotap 1
  2. 2. Data Modelling For Non-Temporal & Nested Data Fuel For Growth.
  3. 3. 3 INTRODUCTION PRESENTER K S Sathish, VP Engineering Sathish heads the engineering at Zeotap. His responsibilities also include engineering strategy and technical architecture. He comes with 18+ years of experience and has been building big data stacks for various verticals for past 8 years. https://www.linkedin.com/in/sathishks/ Chaitanya Bendre, Lead Data Engineer Chaitanya is a Lead Data Engineer at Zeotap. He is a member of the data engineering team where he leads the Insights product suite and ML engineering. He is responsible for the architecture, design and delivery of the product and ML/Data science pipelines. He has spent the last 5 years in building big data systems. https://www.linkedin.com/in/chaitanya-bendre-37585316/
  4. 4. 4 INTRODUCTION ZEOTAP ■ Customer Intelligence Platform – SAAS + DAAS offering ■ Enables Brands to better understand their customers ○ 360 View ○ Identity resolution ○ Activation ■ Data Assets – People centric ○ Identity Data ○ Profile Data ○ Deterministic ○ Probabilistic ■ Full Privacy/GDPR compliant ■ 150+ partners – Southbound and Northbound including Telcos ■ Catering to Ad-Tech and MarTech
  5. 5. 5 INTRODUCTION ADTECH & MARTECH BUSINESS CASE ■ Understanding User scale for Reach purpose ■ Qualitative understanding of Audience composition ■ Data collected would be skewed - bias of their data collection capabilities ■ Near real-time corrected to their population for representative insights ■ Marketing ROI
  6. 6. 6 INTRODUCTION AUDIENCE PLANNER
  7. 7. Use case 1 - Audience Insights
  8. 8. 8 INSIGHTS Use case ■ Need insights on client’s audience data ■ Audience consists of user identifiers + profile at particular point in time ■ Snapshot data (Non-temporal data) ○ Ingestion timestamp as a proxy for timestamp field ■ Interactive OLAP analytics on audience data ○ Flat columns - Ex gender, age ○ Nested columns - Ex apps
  9. 9. 9 INSIGHTS Insights Interactive UI
  10. 10. 10 INSIGHTS DATA ACCESS PATTERNS ■ Top N style queries ○ GroupBy + Limit ○ Multiple columns present in GroupBy clause ■ Only query latest data ■ Queries on nested columns like interests, apps ■ Count(distinct) rollup required ○ Can use HyperLogLog or ThetaSketch in metricsSpec
  11. 11. 11 ■ Option 1 - Create new dataset for each audience ■ Option 2 - Assign unique time interval for each audience and ingest newer data with same time interval. ○ Druid will create new version of the segment with latest data if ingested with same time interval ○ Need 1:1 mapping between audience_id and segment time interval ■ Option 3 - Use ingestion timestamp as timestamp column and metadata to store version mapping ○ Ingestion timestamp considered as version ○ Use metadata to fetch latest version available and query INSIGHTS DATA MODEL
  12. 12. 12 INSIGHTS DATA MODEL: Pros and Cons Data Model choice Pros Cons Option 1 - Create new dataset for each audience ● Each audience data can be maintained independently ● Easy to keep latest data for each audience ● Queries will be faster ● Consumers of datas needs to know which dataset to query ● Very large number of small datasources. Option 2 - Assign unique segment time interval for each audience ● Each audience has its own partition within a shared datasource ● Keeping latest data for each audience is easy ● Have to map audience_id to unique time interval through application logic ● Difficult to ensure no collision Option 3 - Use proxy timestamp version column and store latest version in metadata ● Single datasource ● partitioned by audience_id for better query performance ● Batch indexing of many audiences within a single job ● Increased datasource size on disk ● Need to maintain latest version / timestamp for each audience externally
  13. 13. 13 INSIGHTS ARCHITECTURE
  14. 14. Use case 2 - Audience Estimator
  15. 15. 15 ESTIMATION ESTIMATION USE-CASE & CHALLENGES CHALLENGES ■ Dimensions 600+ ■ Number of entities US ~ 8 Billion ■ Dimension can have nesting ■ Mixed cardinality – Gender, Apps ■ UI implies sub-second latency ■ Store used for data extraction latency 10+s ■ Precomputing would be super expensive ■ Storing all data would require big Druid cluster USE-CASE ■ Dynamic Audience Segment Size Estimation ■ Count of Audience ○ Profile Filter ○ ID Filter ○ ID groups ■ Interactive in UI ■ Update in sync with ID and Profile Store changes ■ Agreed on approximate count
  16. 16. 16 ESTIMATION Sample data model zuId Country Gender Age Apps Usage # Category Bundle Id Recency z1 GBR Female 20 1 Communication com.whatsapp 10 2 Entertainment com.play.games 15 3 Communication com.whatsapp 120 z2 ITA Male 35 1 Productivity com.calculator 179 z3 ESP Male 22 1 Tools com.kuaiya.play 20 2 Productivity com.flashlight 150 z4 GBR Female 28 1 Game Adventure com.classic.bounce 130 z5 FRA Male NULL
  17. 17. 17 ESTIMATOR SAMPLING ■ Original data is sampled by multiple sampling strategies ○ Uniform sampling - 1-5 percent sampled data ○ Stratified sampling - sampling done based on weights and proportions of different attributes with each other ■ Metrics ○ HyperUnique (HyperLogLog) rollup ○ Count of rows ■ Dimensions ○ All profile attributes including nested data column like AppsUsage ○ Nested columns are flattened with flattenSpec.
  18. 18. 18 ESTIMATOR ARCHITECTURE & DATA FLOW
  19. 19. 19 ESTIMATION ORIGINAL DATA – MULTI VALUE DIMENSIONS zuId Apps Usage # Category Bundle Id Recency z1 1 Social com.facebook 10 2 Entertainment com.play.games 15 3 Communication com.whatsapp 120 ■ Constraints ○ Array of struct => Association between category and bundleId ○ Druid only supports "multi-value" string dimensions. For ex, [“Social”, “Entertainment”, “Communication”]
  20. 20. 20 ESTIMATION DATA MODEL – CHALLENGES ■ Only Array of strings support as complex type other than trivial type ■ Queries on Nested data ■ Samples ○ Distinct count of users that are using apps belonging to category Social OR Finance ○ Distinct count of users that are using apps belonging to category Social AND Finance ○ Distinct count of users that are using app with bundleId com.facebook that has category as Social
  21. 21. 21 ESTIMATION DATA MODEL – CHECK1 ■ Count of users with category Social OR Finance ○ AppCategory = ‘Social’ OR AppCategory = ‘Finance’ ■ Count of users with category Social AND Finance - Not possible zuId AppCategory AppBundleId z1 Social com.facebook z1 Finance com.finance z1 Communication com.whatsapp
  22. 22. 22 ESTIMATION DATA MODEL – CHECK2 zuId AppCategoryList AppBundleIdList z1 [Social, Finance, Communication] [com.facebook, com.finance, com.whatsapp] ■ Count of users that are using apps belonging to category Social OR Finance ○ AppCategoryList in (‘Social’, ‘Finance’) ■ Count of users that are using apps belonging to category Social AND Finance ○ AppCategoryList in (‘Social’) AND AppCategoryList in (‘Finance’) ■ Count of users that are using app with bundleId com.facebook that has category as Social - Not possible
  23. 23. 23 ESTIMATION DATA MODEL – CHECK3 and FINAL zuId AppCategory AppBundleId AppCategoryList AppBundleIdList z1 Social com.facebook [Social, Finance, Communication] [com.facebook, com.finance, com.whatsapp] z1 Finance com.play.games [Social, Finance, Communication] [com.facebook, com.finance, com.whatsapp] z1 Communication com.whatsapp [Social, Finance, Communication] [com.facebook, com.finance, com.whatsapp] ■ Count of users that are using apps belonging to category Social OR Finance ○ AppCategory in (‘Social’, ‘Finance’) ■ Count of users that are using apps belonging to category Social AND Finance ○ AppCategoryList in (‘Social’) AND AppCategoryList in (‘Finance’) ■ Count of users that are using app with bundleId com.facebook that has category as Social ○ AppCategory = ‘Social’ and AppBundleId = ‘com.facebook’ ■ Count of users that are using (an app with bundleId com.facebook that has category as Social) AND (an app with bundleId com.whatsapp that has category as Communication) ○ ((AppCategory = ‘Social’ and AppBundleId = ‘com.facebook’) AND ( AppCategoryList in (‘Social’, ‘Finance’) AND AppBundleIdList in (‘com.facebook’, ‘com.whatsapp’)) OR ((AppCategory = ‘Finance’ and AppBundleId = ‘com.whatsapp’’) AND ( AppCategoryList in (‘Social’, ‘Finance’) AND AppBundleIdList in (‘com.facebook’, ‘com.whatsapp’))
  24. 24. Use case 3 - Audience skew correction
  25. 25. 25 SKEW CORRECTION CHALLENGES ■ Data is highly skewed in demographic distribution ○ Gender skew ○ Age skew ■ Need ratio correction wrt to census data ■ Interactive queries with rollup of corrected ratio ■ Augmenting original schema with corrected ratio
  26. 26. 26 SKEW CORRECTION DATA AND QUERY PATTERNS ■ F/M ratio is calculated for each location offline ■ When querying, need to know rolled up f/m ratio for a given clause. For ex, f/m ratio for customer_type = Consumer ■ Average is not available directly in metricSpec for preAggregation / rollup ■ But we can store doubleSum(ratio_city) and HLL for count(distinct) as two separate metrics ■ At runtime, we query sum(ratio_city) / count(distinct users) for getting aggregated ratio user_id gender customer_type location_city f/m ratio_city 1 M Enterprise BLR 0.8 2 M Consumer MUM 1.1 3 F Consumer BLR 0.8 4 M Consumer MUM 1.1
  27. 27. 27 Production configs and scale ■ At Zeotap, we use open source Druid. Currently version 0.18.0 ■ Deployed in GCP ○ Dataproc cluster for Hadoop batch indexing ○ GCS as deep storage ○ Cloud SQL for metadata ■ Cluster size ○ 3 regions. Per region - ■ 1 coordinator + overlord node ■ 6 historical + middleManager nodes - (with 3 month retention period) ■ 2 broker nodes ■ Monitoring via GraphiteEmitter + Custom Grafana dashboards. Have setup alerts on top of Grafana ■ All queries through Druid SQL. In some places, use native queries for better performance
  28. 28. 28 Challenges ■ Monitoring not out of box for Druid open source. We setup monitoring by using GraphiteEmitter and Grafana dashboards ■ Migration from AWS to GCP ■ Not wise to launch all jobs as Hadoop Indexing on YARN cluster. Choice depends on size of input data ○ Smart indexing ■ Data duplication for non-temporal data which does not have natural time dimension ■ Long running queries. Some queries like GroupBy are long running and block entire resources for historical and broker. ○ Query laning released recently
  29. 29. THANK YOU Fuel For Growth.
  30. 30. Time for questions @zeotap 30 Thank you! Apache Druid is an independent project of The Apache Software Foundation. More information can be found at https://druid.apache.org. Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
  31. 31. Dates: TBA druidsummit.org Register Now for the Next Druid Virtual Summit

×