Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distilling insights @ AppsFlyer

1,332 views

Published on

On building machine learning models using Spark @ Appsflyer

The presentation includes a short intro to AppsFlyer (architcture, data architecture) and shows the process for building a model through a use case of building a fingerprinting model for matching clicks to installs

Published in: Technology
  • Login to see the comments

Distilling insights @ AppsFlyer

  1. 1. Distilling insights @ Arnon Rotem-Gal-Oz Chief Data Officer
  2. 2. Data’s hierarchy of needs* *With apologies to Maslow
  3. 3. What is AppsFlyer? What is AppsFlyer? Mobile Attribution Measurement and Analytics Mobile attribution measurement and analysts
  4. 4. 5 Let’s drill down
  5. 5. Kafka Columnar Database (Redshift- evaluating Vertica) Secor Aggregations SparkSQL (evaluating Drill, Presto) SQL SQL Raw (sequence files) DW (parquet files) DM (Aggregations) Vishnu Self-serve BI (TBD) Spark Spark ML Latest Events Scoring Blueshift Mojito installs clicksinapplaunches Spark Spark ETL Accounts Application dashboard Latestevent exploration Kafka Columnar Database (Redshift- evaluating Vertica) Secor Aggregations SparkSQL (evaluating Drill, Presto) SQL SQL Raw (sequence files) DW (parquet files) DM (Aggregations) Vishnu Self-serve BI (TBD) Spark Spark ML Latest Events Scoring Blueshift Mojito installs clicksinapplaunches Spark Spark ETL Accounts Application dashboard Latestevent exploration
  6. 6. Understand the problem 8
  7. 7. Mobile Advertising 9
  8. 8. Mobile attribution 10
  9. 9. Fingerprinting 12
  10. 10. Get the data from the big data lake
  11. 11. Or locate it somehow in the big data swamp…
  12. 12. Exploration 15
  13. 13. SparkSQL is a nice tool to find relevant data 16 """ |select |m.raw_device_params.brand, |m.raw_device_params.model, |m.raw_device_params.lang, |m.raw_device_params.carrier, |m.raw_device_params.network, |m.raw_device_params.currency, |m.geo_info.city as m_city, |m.geo_info.country_code as m_country, |m.geo_info.region as m_region, |m.device.os_version as m_os, |m.device.language as m_lang, |m.timestamp as m_timestamp, |m.ip as m_ip, |c.ip as c_ip, |c.event_time as c_timestamp, |c.original_url as c_url, |c.user_agent as c_ua, |c.client_cookie as c_cookie, |c.country as c_country, |c.region as c_region, |c.city as c_city, |c.language as c_lang, |1 as is_match, | from c join m on (m.app_id = c.app_id and m.attribution.transaction_id=c.transaction_id and m.attribution.`match-type`='ref' ) """.stripMargin root |-- action_context: string (nullable = true) |-- action_name: string (nullable = true) |-- action_type: string (nullable = true) |-- alg-check-timestamp: struct (nullable = true) | |-- in: string (nullable = true) | |-- out: string (nullable = true) |-- api_version: string (nullable = true) |-- app_id: string (nullable = true) |-- app_name: string (nullable = true) |-- attribution: struct (nullable = true) | |-- action_context: string (nullable = true) | |-- action_name: string (nullable = true) | |-- action_type: string (nullable = true) | |-- additional_data: struct (nullable = true) | | |-- C50%New: string (nullable = true) | | |-- Idfa: string (nullable = true) | | |-- LineItemId: string (nullable = true) | | |-- MID: string (nullable = true) | | |-- PRD: string (nullable = true) | | |-- PublisherId: string (nullable = true) | | |-- SD: string (nullable = true) | | |-- SSO: string (nullable = true) | | |-- SetUid: string (nullable = true) | | |-- Source: string (nullable = true) | | |-- SourceId: string (nullable = true) | | |-- UID: string (nullable = true) | | |-- UUID: string (nullable = true) | | |-- W-all-25-60-audiobook: string (nullable = true) | | |-- _: string (nullable = true) | | |-- a76852453141ced: string (nullable = true) | | |-- actionid: string (nullable = true) | | |-- ad_sub1: string (nullable = true) | | |-- adid: string (nullable = true) | | |-- advertising_id: string (nullable = true) | | |-- advertising_id : string (nullable = true) | | |-- adxclkid: string (nullable = true) | | |-- af: string (nullable = true) | | |-- af_cid: string (nullable = true) | | |-- af_cpi: string (nullable = true) | | |-- af_dp: string (nullable = true) | | |-- af_google_channel: string (nullable = true) | | |-- af_id: string (nullable = true) | | |-- af_installpostback: string (nullable = true) | | |-- af_prt: string (nullable = true)
  14. 14. 17
  15. 15. ` • ] Feature selection
  16. 16. UDFs to generate features 19
  17. 17. What’s the distance between two IP addresses 20
  18. 18. Big data doesn’t always mean we need to analyze petabytes of data sometimes it means we can find just the right sample 21
  19. 19. Model selection • Naive Bayes (built in) • Logistic Regression (built in) • SVM (built in) • Decision trees (built in) • Locality sensitive hashing (https://github.com/mrsqueeze/spark-hash) 22
  20. 20. Transform from Data frames to MLLib 23 LabeledPoint Vectors.Dense Row Schema categoricalFeature
  21. 21. Model evaluation 24
  22. 22. Torture the data enough and it will confess to anything 25
  23. 23. • Big data is not just about big data • Getting insights - It’s a process • Spark is great but can drive you crazy :) 26 Takeaways
  24. 24. Summary • Understand the problem • Data exploration • Feature selection (and building) • (ETLing) • Model selection • Model evaluation 27
  25. 25. 28 We’re hiring…. jobs@appsflyer.com

×