SlideShare a Scribd company logo
1 of 28
Download to read offline
Distilling insights @
Arnon	Rotem-Gal-Oz	
Chief	Data	Officer
Data’s hierarchy of needs*
*With	apologies	to	Maslow
What is AppsFlyer?
What is AppsFlyer?
Mobile Attribution Measurement and Analytics
Mobile	attribution	measurement	and	analysts
5
Let’s drill down
Kafka
Columnar Database
(Redshift- evaluating Vertica)
Secor
Aggregations
SparkSQL
(evaluating
Drill,
Presto)
SQL
SQL
Raw
(sequence
files)
DW
(parquet
files)
DM
(Aggregations)
Vishnu
Self-serve
BI
(TBD)
Spark
Spark
ML
Latest Events
Scoring
Blueshift
Mojito
installs clicksinapplaunches
Spark
Spark
ETL
Accounts
Application
dashboard
Latestevent
exploration
Kafka
Columnar Database
(Redshift- evaluating Vertica)
Secor
Aggregations
SparkSQL
(evaluating
Drill,
Presto)
SQL
SQL
Raw
(sequence
files)
DW
(parquet
files)
DM
(Aggregations)
Vishnu
Self-serve
BI
(TBD)
Spark
Spark
ML
Latest Events
Scoring
Blueshift
Mojito
installs clicksinapplaunches
Spark
Spark
ETL
Accounts
Application
dashboard
Latestevent
exploration
Understand the problem
8
Mobile
Advertising
9
Mobile attribution
10
Fingerprinting
12
Get the data from the
big data lake
Or locate it somehow in the
big data swamp…
Exploration
15
SparkSQL is a nice tool
to find relevant data
16
"""	
								|select	
								|m.raw_device_params.brand,	
								|m.raw_device_params.model,	
								|m.raw_device_params.lang,	
								|m.raw_device_params.carrier,	
								|m.raw_device_params.network,	
								|m.raw_device_params.currency,	
								|m.geo_info.city	as	m_city,	
								|m.geo_info.country_code	as	m_country,	
								|m.geo_info.region	as	m_region,	
								|m.device.os_version	as	m_os,	
								|m.device.language	as	m_lang,	
								|m.timestamp	as	m_timestamp,	
								|m.ip	as	m_ip,	
								|c.ip	as	c_ip,	
								|c.event_time	as	c_timestamp,	
								|c.original_url	as	c_url,	
								|c.user_agent	as	c_ua,	
								|c.client_cookie	as	c_cookie,	
								|c.country	as	c_country,	
								|c.region	as	c_region,	
								|c.city	as	c_city,	
								|c.language	as	c_lang,	
								|1	as	is_match,	
								|	from	c	join	m	on	(m.app_id	=	c.app_id	and	
											m.attribution.transaction_id=c.transaction_id	and	m.attribution.`match-type`='ref'	)	
						""".stripMargin
root	
	|--	action_context:	string	(nullable	=	true)	
	|--	action_name:	string	(nullable	=	true)	
	|--	action_type:	string	(nullable	=	true)	
	|--	alg-check-timestamp:	struct	(nullable	=	true)	
	|				|--	in:	string	(nullable	=	true)	
	|				|--	out:	string	(nullable	=	true)	
	|--	api_version:	string	(nullable	=	true)	
	|--	app_id:	string	(nullable	=	true)	
	|--	app_name:	string	(nullable	=	true)	
	|--	attribution:	struct	(nullable	=	true)	
	|				|--	action_context:	string	(nullable	=	true)	
	|				|--	action_name:	string	(nullable	=	true)	
	|				|--	action_type:	string	(nullable	=	true)	
	|				|--	additional_data:	struct	(nullable	=	true)	
	|				|				|--	C50%New:	string	(nullable	=	true)	
	|				|				|--	Idfa:	string	(nullable	=	true)	
	|				|				|--	LineItemId:	string	(nullable	=	true)	
	|				|				|--	MID:	string	(nullable	=	true)	
	|				|				|--	PRD:	string	(nullable	=	true)	
	|				|				|--	PublisherId:	string	(nullable	=	true)	
	|				|				|--	SD:	string	(nullable	=	true)	
	|				|				|--	SSO:	string	(nullable	=	true)	
	|				|				|--	SetUid:	string	(nullable	=	true)	
	|				|				|--	Source:	string	(nullable	=	true)	
	|				|				|--	SourceId:	string	(nullable	=	true)	
	|				|				|--	UID:	string	(nullable	=	true)	
	|				|				|--	UUID:	string	(nullable	=	true)	
	|				|				|--	W-all-25-60-audiobook:	string	(nullable	=	true)	
	|				|				|--	_:	string	(nullable	=	true)	
	|				|				|--	a76852453141ced:	string	(nullable	=	true)	
	|				|				|--	actionid:	string	(nullable	=	true)	
	|				|				|--	ad_sub1:	string	(nullable	=	true)	
	|				|				|--	adid:	string	(nullable	=	true)	
	|				|				|--	advertising_id:	string	(nullable	=	true)	
	|				|				|--	advertising_id	:	string	(nullable	=	true)	
	|				|				|--	adxclkid:	string	(nullable	=	true)	
	|				|				|--	af:	string	(nullable	=	true)	
	|				|				|--	af_cid:	string	(nullable	=	true)	
	|				|				|--	af_cpi:	string	(nullable	=	true)	
	|				|				|--	af_dp:	string	(nullable	=	true)	
	|				|				|--	af_google_channel:	string	(nullable	=	true)	
	|				|				|--	af_id:	string	(nullable	=	true)	
	|				|				|--	af_installpostback:	string	(nullable	=	true)	
	|				|				|--	af_prt:	string	(nullable	=	true)
17
`
• ]
Feature selection
UDFs to generate features
19
What’s the distance between
two IP addresses
20
Big data doesn’t always mean we
need to analyze petabytes of data
sometimes it means we can find
just the right sample
21
Model selection
• Naive	Bayes	(built	in)	
• Logistic	Regression	(built	in)	
• SVM	(built	in)	
• Decision	trees	(built	in)	
• Locality	sensitive	hashing																
(https://github.com/mrsqueeze/spark-hash)	
22
Transform from
Data frames to MLLib
23
LabeledPoint
Vectors.Dense
Row
Schema categoricalFeature
Model evaluation
24
Torture the data enough and it
will confess to anything
25
• Big	data	is	not	just	about	big	data	
• Getting	insights	-	It’s	a	process	
• Spark	is	great	but	can	drive	you	crazy	:)
26
Takeaways
Summary
• Understand	the	problem	
• Data	exploration	
• Feature	selection	(and	building)	
• (ETLing)	
• Model	selection	
• Model	evaluation
27
28
We’re	hiring….	
jobs@appsflyer.com

More Related Content

What's hot

Bridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesBridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesDatabricks
 
Processing genetic data at scale
Processing genetic data at scaleProcessing genetic data at scale
Processing genetic data at scaleMark Schroering
 
Data Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at ZalandoData Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at ZalandoDatabricks
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastDatabricks
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with HadoopJayant Shekhar
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Denny Lee
 
Big Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaBig Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaSpark Summit
 
DMM.comラボはなぜSparkを採用したのか?レコメンドエンジン開発の裏側をお話します!
DMM.comラボはなぜSparkを採用したのか?レコメンドエンジン開発の裏側をお話します!DMM.comラボはなぜSparkを採用したのか?レコメンドエンジン開発の裏側をお話します!
DMM.comラボはなぜSparkを採用したのか?レコメンドエンジン開発の裏側をお話します!leverages_event
 
Stream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobStream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobDatabricks
 
Building the Autodesk Design Graph-(Yotto Koga, Autodesk)
Building the Autodesk Design Graph-(Yotto Koga, Autodesk)Building the Autodesk Design Graph-(Yotto Koga, Autodesk)
Building the Autodesk Design Graph-(Yotto Koga, Autodesk)Spark Summit
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks
 
Vectorized R Execution in Apache Spark
Vectorized R Execution in Apache SparkVectorized R Execution in Apache Spark
Vectorized R Execution in Apache SparkDatabricks
 
Shifting Data Science into High Gear
Shifting Data Science into High GearShifting Data Science into High Gear
Shifting Data Science into High GearSpark Summit
 
How USCIS Powered a Digital Transition to eProcessing with Kafka (Rob Brown &...
How USCIS Powered a Digital Transition to eProcessing with Kafka (Rob Brown &...How USCIS Powered a Digital Transition to eProcessing with Kafka (Rob Brown &...
How USCIS Powered a Digital Transition to eProcessing with Kafka (Rob Brown &...confluent
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...Databricks
 

What's hot (20)

AI at Scale
AI at ScaleAI at Scale
AI at Scale
 
Bridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesBridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFrames
 
Processing genetic data at scale
Processing genetic data at scaleProcessing genetic data at scale
Processing genetic data at scale
 
Data Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at ZalandoData Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at Zalando
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
 
Sparkflows Use Cases
Sparkflows Use CasesSparkflows Use Cases
Sparkflows Use Cases
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
 
Spark Worshop
Spark WorshopSpark Worshop
Spark Worshop
 
Big Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaBig Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al Essa
 
DMM.comラボはなぜSparkを採用したのか?レコメンドエンジン開発の裏側をお話します!
DMM.comラボはなぜSparkを採用したのか?レコメンドエンジン開発の裏側をお話します!DMM.comラボはなぜSparkを採用したのか?レコメンドエンジン開発の裏側をお話します!
DMM.comラボはなぜSparkを採用したのか?レコメンドエンジン開発の裏側をお話します!
 
Stream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobStream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the Job
 
Building the Autodesk Design Graph-(Yotto Koga, Autodesk)
Building the Autodesk Design Graph-(Yotto Koga, Autodesk)Building the Autodesk Design Graph-(Yotto Koga, Autodesk)
Building the Autodesk Design Graph-(Yotto Koga, Autodesk)
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
 
Vectorized R Execution in Apache Spark
Vectorized R Execution in Apache SparkVectorized R Execution in Apache Spark
Vectorized R Execution in Apache Spark
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Shifting Data Science into High Gear
Shifting Data Science into High GearShifting Data Science into High Gear
Shifting Data Science into High Gear
 
How USCIS Powered a Digital Transition to eProcessing with Kafka (Rob Brown &...
How USCIS Powered a Digital Transition to eProcessing with Kafka (Rob Brown &...How USCIS Powered a Digital Transition to eProcessing with Kafka (Rob Brown &...
How USCIS Powered a Digital Transition to eProcessing with Kafka (Rob Brown &...
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
 

Viewers also liked

AppsFlyer Mobile App Tracking | Campaign & Engagement Analytics
AppsFlyer Mobile App Tracking | Campaign & Engagement AnalyticsAppsFlyer Mobile App Tracking | Campaign & Engagement Analytics
AppsFlyer Mobile App Tracking | Campaign & Engagement AnalyticsAppsFlyer
 
Mobile Moments NYC 2016
Mobile Moments NYC 2016Mobile Moments NYC 2016
Mobile Moments NYC 2016Swrve_Inc
 
Think Mobile with Google Event - AppsFlyer Presentation - Chinese
Think Mobile with Google Event - AppsFlyer Presentation - ChineseThink Mobile with Google Event - AppsFlyer Presentation - Chinese
Think Mobile with Google Event - AppsFlyer Presentation - ChineseAppsFlyer
 
The Evolution of Mobile Advertising - AppsFlyer Presentation at Israel Moneti...
The Evolution of Mobile Advertising - AppsFlyer Presentation at Israel Moneti...The Evolution of Mobile Advertising - AppsFlyer Presentation at Israel Moneti...
The Evolution of Mobile Advertising - AppsFlyer Presentation at Israel Moneti...AppsFlyer
 
Mobile And The Media 5 Ways To Succeed In 2016
Mobile And The Media 5 Ways To Succeed In 2016Mobile And The Media 5 Ways To Succeed In 2016
Mobile And The Media 5 Ways To Succeed In 2016Swrve_Inc
 
Boost Retention on Mobile and Keep Users Coming Back for More!
Boost Retention on Mobile and Keep Users Coming Back for More!Boost Retention on Mobile and Keep Users Coming Back for More!
Boost Retention on Mobile and Keep Users Coming Back for More!InMobi
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
AppsFlyer performance index российское издание appdflyer_катерина шапиро
AppsFlyer performance index российское издание appdflyer_катерина шапироAppsFlyer performance index российское издание appdflyer_катерина шапиро
AppsFlyer performance index российское издание appdflyer_катерина шапироAppsFlyer
 

Viewers also liked (10)

AppsFlyer Mobile App Tracking | Campaign & Engagement Analytics
AppsFlyer Mobile App Tracking | Campaign & Engagement AnalyticsAppsFlyer Mobile App Tracking | Campaign & Engagement Analytics
AppsFlyer Mobile App Tracking | Campaign & Engagement Analytics
 
Mobile Moments NYC 2016
Mobile Moments NYC 2016Mobile Moments NYC 2016
Mobile Moments NYC 2016
 
Alexander Grach, Appsflyer
Alexander Grach, AppsflyerAlexander Grach, Appsflyer
Alexander Grach, Appsflyer
 
devopstools
devopstoolsdevopstools
devopstools
 
Think Mobile with Google Event - AppsFlyer Presentation - Chinese
Think Mobile with Google Event - AppsFlyer Presentation - ChineseThink Mobile with Google Event - AppsFlyer Presentation - Chinese
Think Mobile with Google Event - AppsFlyer Presentation - Chinese
 
The Evolution of Mobile Advertising - AppsFlyer Presentation at Israel Moneti...
The Evolution of Mobile Advertising - AppsFlyer Presentation at Israel Moneti...The Evolution of Mobile Advertising - AppsFlyer Presentation at Israel Moneti...
The Evolution of Mobile Advertising - AppsFlyer Presentation at Israel Moneti...
 
Mobile And The Media 5 Ways To Succeed In 2016
Mobile And The Media 5 Ways To Succeed In 2016Mobile And The Media 5 Ways To Succeed In 2016
Mobile And The Media 5 Ways To Succeed In 2016
 
Boost Retention on Mobile and Keep Users Coming Back for More!
Boost Retention on Mobile and Keep Users Coming Back for More!Boost Retention on Mobile and Keep Users Coming Back for More!
Boost Retention on Mobile and Keep Users Coming Back for More!
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
AppsFlyer performance index российское издание appdflyer_катерина шапиро
AppsFlyer performance index российское издание appdflyer_катерина шапироAppsFlyer performance index российское издание appdflyer_катерина шапиро
AppsFlyer performance index российское издание appdflyer_катерина шапиро
 

Similar to Distilling insights @ AppsFlyer

Picking the right streaming tools for the job
Picking the right streaming tools for the jobPicking the right streaming tools for the job
Picking the right streaming tools for the jobOfir Manor
 
Adf dw walkthrough
Adf dw walkthroughAdf dw walkthrough
Adf dw walkthroughMSDEVMTL
 
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...Amazon Web Services
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
 
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsScalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsSpark Summit
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Imply
 
Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social mediaDataWorks Summit
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...Impetus Technologies
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
 
Fire-fighting java big data problems
Fire-fighting java big data problemsFire-fighting java big data problems
Fire-fighting java big data problemsgrepalex
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
SemTech 2010: Pelorus Platform
SemTech 2010: Pelorus PlatformSemTech 2010: Pelorus Platform
SemTech 2010: Pelorus PlatformClark & Parsia LLC
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
 
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the Job
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the JobAkka, Spark or Kafka? Selecting The Right Streaming Engine For the Job
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the JobLightbend
 
Big Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataBig Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataWeCloudData
 

Similar to Distilling insights @ AppsFlyer (20)

Picking the right streaming tools for the job
Picking the right streaming tools for the jobPicking the right streaming tools for the job
Picking the right streaming tools for the job
 
Adf dw walkthrough
Adf dw walkthroughAdf dw walkthrough
Adf dw walkthrough
 
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
 
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsScalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 
Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social media
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
 
Big Data on the Cloud
Big Data on the CloudBig Data on the Cloud
Big Data on the Cloud
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Fire-fighting java big data problems
Fire-fighting java big data problemsFire-fighting java big data problems
Fire-fighting java big data problems
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
SemTech 2010: Pelorus Platform
SemTech 2010: Pelorus PlatformSemTech 2010: Pelorus Platform
SemTech 2010: Pelorus Platform
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the Job
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the JobAkka, Spark or Kafka? Selecting The Right Streaming Engine For the Job
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the Job
 
Big Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataBig Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudData
 

More from Arnon Rotem-Gal-Oz (20)

Taking ML to production - a journey
Taking ML to production - a journeyTaking ML to production - a journey
Taking ML to production - a journey
 
Apache spark
Apache sparkApache spark
Apache spark
 
Fallacies of Distributed Computing
Fallacies of Distributed Computing Fallacies of Distributed Computing
Fallacies of Distributed Computing
 
Docker & Kubernetes intro
Docker & Kubernetes introDocker & Kubernetes intro
Docker & Kubernetes intro
 
Docker Intro
Docker IntroDocker Intro
Docker Intro
 
Data security @ the personal level
Data security @ the personal levelData security @ the personal level
Data security @ the personal level
 
Microservices - it's déjà vu all over again
Microservices  - it's déjà vu all over againMicroservices  - it's déjà vu all over again
Microservices - it's déjà vu all over again
 
Big data in the cloud - welcome to cost oriented design
Big data in the cloud - welcome to cost oriented designBig data in the cloud - welcome to cost oriented design
Big data in the cloud - welcome to cost oriented design
 
Big data Overview
Big data OverviewBig data Overview
Big data Overview
 
Hadoop YARN overview
Hadoop YARN overviewHadoop YARN overview
Hadoop YARN overview
 
SAF
SAFSAF
SAF
 
REST presentation
REST presentationREST presentation
REST presentation
 
SOA & Big Data
SOA & Big DataSOA & Big Data
SOA & Big Data
 
Why the JVM?
Why the JVM?Why the JVM?
Why the JVM?
 
Building reliable systems from unreliable components
Building reliable systems from unreliable componentsBuilding reliable systems from unreliable components
Building reliable systems from unreliable components
 
Azure migration
Azure migrationAzure migration
Azure migration
 
Things to think about while architecting azure solutions
Things to think about while architecting azure solutionsThings to think about while architecting azure solutions
Things to think about while architecting azure solutions
 
Soa
Soa Soa
Soa
 
Rest
RestRest
Rest
 
SOA patterns
SOA patterns SOA patterns
SOA patterns
 

Recently uploaded

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Distilling insights @ AppsFlyer

  • 2.
  • 3. Data’s hierarchy of needs* *With apologies to Maslow
  • 4. What is AppsFlyer? What is AppsFlyer? Mobile Attribution Measurement and Analytics Mobile attribution measurement and analysts
  • 6.
  • 7. Kafka Columnar Database (Redshift- evaluating Vertica) Secor Aggregations SparkSQL (evaluating Drill, Presto) SQL SQL Raw (sequence files) DW (parquet files) DM (Aggregations) Vishnu Self-serve BI (TBD) Spark Spark ML Latest Events Scoring Blueshift Mojito installs clicksinapplaunches Spark Spark ETL Accounts Application dashboard Latestevent exploration Kafka Columnar Database (Redshift- evaluating Vertica) Secor Aggregations SparkSQL (evaluating Drill, Presto) SQL SQL Raw (sequence files) DW (parquet files) DM (Aggregations) Vishnu Self-serve BI (TBD) Spark Spark ML Latest Events Scoring Blueshift Mojito installs clicksinapplaunches Spark Spark ETL Accounts Application dashboard Latestevent exploration
  • 11.
  • 13. Get the data from the big data lake
  • 14. Or locate it somehow in the big data swamp…
  • 16. SparkSQL is a nice tool to find relevant data 16 """ |select |m.raw_device_params.brand, |m.raw_device_params.model, |m.raw_device_params.lang, |m.raw_device_params.carrier, |m.raw_device_params.network, |m.raw_device_params.currency, |m.geo_info.city as m_city, |m.geo_info.country_code as m_country, |m.geo_info.region as m_region, |m.device.os_version as m_os, |m.device.language as m_lang, |m.timestamp as m_timestamp, |m.ip as m_ip, |c.ip as c_ip, |c.event_time as c_timestamp, |c.original_url as c_url, |c.user_agent as c_ua, |c.client_cookie as c_cookie, |c.country as c_country, |c.region as c_region, |c.city as c_city, |c.language as c_lang, |1 as is_match, | from c join m on (m.app_id = c.app_id and m.attribution.transaction_id=c.transaction_id and m.attribution.`match-type`='ref' ) """.stripMargin root |-- action_context: string (nullable = true) |-- action_name: string (nullable = true) |-- action_type: string (nullable = true) |-- alg-check-timestamp: struct (nullable = true) | |-- in: string (nullable = true) | |-- out: string (nullable = true) |-- api_version: string (nullable = true) |-- app_id: string (nullable = true) |-- app_name: string (nullable = true) |-- attribution: struct (nullable = true) | |-- action_context: string (nullable = true) | |-- action_name: string (nullable = true) | |-- action_type: string (nullable = true) | |-- additional_data: struct (nullable = true) | | |-- C50%New: string (nullable = true) | | |-- Idfa: string (nullable = true) | | |-- LineItemId: string (nullable = true) | | |-- MID: string (nullable = true) | | |-- PRD: string (nullable = true) | | |-- PublisherId: string (nullable = true) | | |-- SD: string (nullable = true) | | |-- SSO: string (nullable = true) | | |-- SetUid: string (nullable = true) | | |-- Source: string (nullable = true) | | |-- SourceId: string (nullable = true) | | |-- UID: string (nullable = true) | | |-- UUID: string (nullable = true) | | |-- W-all-25-60-audiobook: string (nullable = true) | | |-- _: string (nullable = true) | | |-- a76852453141ced: string (nullable = true) | | |-- actionid: string (nullable = true) | | |-- ad_sub1: string (nullable = true) | | |-- adid: string (nullable = true) | | |-- advertising_id: string (nullable = true) | | |-- advertising_id : string (nullable = true) | | |-- adxclkid: string (nullable = true) | | |-- af: string (nullable = true) | | |-- af_cid: string (nullable = true) | | |-- af_cpi: string (nullable = true) | | |-- af_dp: string (nullable = true) | | |-- af_google_channel: string (nullable = true) | | |-- af_id: string (nullable = true) | | |-- af_installpostback: string (nullable = true) | | |-- af_prt: string (nullable = true)
  • 17. 17
  • 19. UDFs to generate features 19
  • 20. What’s the distance between two IP addresses 20
  • 21. Big data doesn’t always mean we need to analyze petabytes of data sometimes it means we can find just the right sample 21
  • 22. Model selection • Naive Bayes (built in) • Logistic Regression (built in) • SVM (built in) • Decision trees (built in) • Locality sensitive hashing (https://github.com/mrsqueeze/spark-hash) 22
  • 23. Transform from Data frames to MLLib 23 LabeledPoint Vectors.Dense Row Schema categoricalFeature
  • 25. Torture the data enough and it will confess to anything 25
  • 26. • Big data is not just about big data • Getting insights - It’s a process • Spark is great but can drive you crazy :) 26 Takeaways
  • 27. Summary • Understand the problem • Data exploration • Feature selection (and building) • (ETLing) • Model selection • Model evaluation 27