Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Game Analytics at London Apache Druid Meetup

Presentation given by Ramón Lastres Guerrero of Game Analytics.

  • Be the first to comment

  • Be the first to like this

Game Analytics at London Apache Druid Meetup

  1. 1. Apache Druid @ London Apache Druid Meetup 15th of January 2020
  2. 2. Agenda ➢ Introduction to GameAnalytics ➢ Backend overview ➢ Druid based solution ○ Schema design ○ Ingestion ○ Performance ○ Monitoring ➢ Planned next steps ➢ Summary / Takeaways ➢ Questions
  3. 3. Introduction to GameAnalytics We provide user behaviour analytics for video game developers Similar to services like Google Analytics, Firebase, Facebook Analytics and so on.. In contrast to those services, we are focused on just gaming We provide SDKs for the most popular game development tools We also provide a Rest API: https://gameanalytics.com/docs/item/rest-api-doc The main tool game developers interact with is our web application, where they can see results in real time and also historical aggregates.
  4. 4. Introduction to GameAnalytics 125M+ 25,000+ Daily Active Games DAU (Daily Active Users) 15B+ JSON All of our Data is in JSON format Billion events per day (at peak days) How much data do we process? 1.2B+ MAU (Monthly Active Users)
  5. 5. 25,000 Daily Active Games
  6. 6. Analytics for 90,000 Game Developers
  7. 7. Interactive Filtering
  8. 8. Interactive Filtering
  9. 9. Technical Requirements What are the high level technical requirements for a service like GameAnalytics ? ● Low query latency (responsive Frontend) ● Streaming ingestion and real time queries with relatively small delay ● Reliability ● Keep infrastructure cost low ● Provide flexible querying for users ● Most of the queries relate to number of unique users count
  10. 10. Backend Overview We can talk about three main components or services ● Data collection ● Data annotation (enrichment) ● Aggregation and reporting
  11. 11. Data Collection We run a web service with an auto scaling group It simply writes the raw JSON events to S3 with some buffering We have some articles in our blog about this topic
  12. 12. Data Annotation System s3 s3 DynamoDB
  13. 13. Data Annotation System We run micro-batching We keep the state in DynamoDB, with cache in Redis Moving to read and write from / to Kinesis (about to deploy to production) The service annotates events to make querying in the follow up service easier More on this topic later
  14. 14. Aggregation and Reporting: Legacy System Legacy system
  15. 15. Aggregation and Reporting: Legacy System Implemented using Erlang Data storage in-memory (recent data) and DynamoDB (historical) It supported streaming (micro batching) and real time queries Query latency was fast In-house implementation of HyperLogLog algorithm is open source: https://github.com/GameAnalytics/hyper
  16. 16. Aggregation and Reporting: Challenges We had several problems ● Cost: traffic was increasing and the system was not cost efficient enough ● Reliability: Being a master-slave architecture made it difficult to make it stable ● Difficult to implement new features ● Knowledge of the code base was lost ● It was only possible to filter using one dimension We needed a replacement that would allow us to spend more time delivering valuable features for our customers, and at the same time control cost and be able to scale easily
  17. 17. Aggregation and Reporting: Druid s3
  18. 18. Druid: Schema Design Schema design is key for optimizing performance and cost When implementing Druid, we ran several ingestion tests in order to test resulting rollups. You want to achieve the best rollup possible that can fulfill your query requirements We have mostly one big datasource and one streaming ingestion We use HyperLogLog sketches for most of our queries We ingest with hourly granularity, and then later on rollup to daily using EMR Druid documentation is your friend: https://druid.apache.org/docs/latest/ingestion/schema-design.html
  19. 19. Druid: Schema Design. HyperLogLog From Wikipedia: HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Druid provides a HLL based aggregator We leverage this at GA, since our queries report mostly on a per user basis: ● Active Users (Daily, Weekly, Monthly) ● Average Revenue Per Daily Active User (ARPDAU) ● Retention ● ….
  20. 20. Druid: Schema Design. Metrics and Dimensions We currently have 53 dimensions and 10 metrics The resulting roll-up ends up with about 10x less number of rows than raw data
  21. 21. Druid: Real-time Ingestion We ingest data in a streaming fashion using Kinesis Ingestion Service (KIS) You need one Kinesis ingestion per each datasource We re-ingest our main datasource after 48 hours with daily granularity Kinesis ingestion does not generate perfect rollup segments, so there is the need of reingestion or segments compaction for optimal performance There are different approaches we could take
  22. 22. Druid: Batch Ingestion You can use segments compaction to create better segments from the KIS ones We must reingest however to change granularity from hour to day It is also possible to re-ingest from the Kinesis segments instead of ingesting from raw data again (both using EMR and native ingestion) We use EMR, but it is also possible to use Druid native ingestion (index parallel) which can guarantee perfect rollup (in latest Druid versions)
  23. 23. Druid: Batch Ingestion Coordination We use AWS Step Functions and Lambdas to coordinate EMR ingestion for Druid clusters
  24. 24. Druid: Cluster Topology
  25. 25. Druid: Tiering Two Tiers Our oldest data (older than 6 months) is accessed less often, so we can manage serving it with less powerful hardware
  26. 26. Druid: Tiering You should leverage tiering accordingly with your query patterns In our case example, we use it to lower costs by serving data that is accessed less often with cheaper hardware Other use cases could be for example serving more frequently accessed data with more powerful hardware (in memory, AWS R type instances) We are considering doing this in the future
  27. 27. Druid: Query Layer We decided to build our own query layer, which allows us to ● Provide higher abstraction for Frontend, and define metrics on backend side ● Implement authentication that works well with our other backend systems ● Fine tune things like caching, query priorities, rate limiting and so on ● Use a programming language that we are comfortable with We implemented the query layer using the Elixir language There was no available Druid client for Elixir, so we created our own. It allows you to build Druid queries using macros and translates those to Druid JSON It is open source: https://github.com/GameAnalytics/panoramix
  28. 28. Druid: Query Layer Caching Initially, we went live without cache in the query layer It was a huge mistake, as we assumed caching on Druid brokers would be enough Lesson learned, to always implement good caching in front of your DB in a use case like this one After adding caching, query latency improved dramatically
  29. 29. Druid: Query Layer Caching
  30. 30. Druid: Performance on Druid brokers
  31. 31. Druid: Performance on Query Layer
  32. 32. Druid: Performance We keep data for 1 year in Druid cluster Around 40k queries per hour to Druid cluster We process 15 billion events per day, with peaks of over 250k events per second
  33. 33. Druid: Imply Pivot Imply Pivot is a potential alternative to writing your own query layer and Frontend We leverage Pivot for internal use within GA It is also possible to use it as an interface for external customers
  34. 34. Druid: Pivot
  35. 35. Druid: Monitoring and Upgrades We use Graphite and Grafana for application level monitoring on the query layer To monitor the Druid cluster we use Imply Clarity And of course, we can also use Cloudwatch for monitoring both Rolling cluster upgrades are automated by Imply Cloud
  36. 36. Druid: Monitoring with Clarity
  37. 37. Druid: What about the annotations? We talked about the annotation service before
  38. 38. Druid: Annotation System Two sources of data: SDK (devices) and attribution partners SDK provides user behaviour, while attribution is about where the user comes from Users want to filter the data on both user behaviour and attribution We need to join both data streams
  39. 39. Druid: Annotation System From Druid documentation: If you need to join two large distributed tables with each other, you must do this before loading the data into Druid. Druid does not support query-time joins of two datasources. Our annotation service prepares data for ingestion into Druid
  40. 40. Druid: Annotation System This is a design choice as stated in Druid documentation: https://druid.apache.org/docs/latest/querying/joins.html Support for joins is in Druid roadmap, however it is possible to perform simple joins using lookups right now Some preparation before ingestion into Druid is generally needed, however Druid ingestion handles things such as aggregation, rollup, filtering, one time processing guarantees and so on for you
  41. 41. Druid: Lookups Lookups feature allows simple joins with data stored outside of Druid In our case, we have the following relation Game_id -> studio_id -> organization_id We only ingest game_id into Druid, and using lookups we can query on studio and organization level, which are stored in a MySQL DB
  42. 42. Druid: Calculating Player Retention We also make use of the annotation service for other things such as player retention calculation The most common way to calculate retention in Druid would be using the Thetha Sketches feature. Adding the sketches to our datasource increases the size by 30% (and therefore, the cost) We annotate events with installation timestamp (truncated) so that we do not need the sketches to calculate retention
  43. 43. Druid: Other considerations Data Partitioning It is possible to ingest data partitioned by tenant_id (game_id) Multiple datasources instead of just one Ingesting data several times into different datasources removing certain dimensions might allow to speed query times
  44. 44. Druid @ GameAnalytics: What next? We are building an A/B testing solution, and Druid plays an important role Maybe implementing a funnels feature using Druid Thetha Sketches? https://imply.io/post/clickstream-funnel-analysis-with-apache-druid We are about to enable query vectorization in production
  45. 45. Resources GameAnalytics technical blog https://gameanalytics.com/blog/category/game-development/engineering GameAnalytics case study (Imply blog) https://imply.io/post/why-gameanalytics-migrated-to-druid Druid Joins design proposal https://github.com/apache/druid/issues/8728

×