SlideShare a Scribd company logo
1 of 31
© 2015 IBM Corporation
Predict whether an online ad will be
clicked using Spark MLlib pipelines
Presenter: Manisha Sule, IBM Spark Enablement
© 2015 IBM Corporation2
Outline
 Online advertising and click through rate prediction
 Data set and features
 Spark MLlib and the Pipeline API
 MLlib pipeline for Click Through Rate Prediction
 Questions
Online Advertising
© 2015 IBM Corporation4
Online advertising
© 2015 IBM Corporation5
Online advertising on search engines
© 2015 IBM Corporation6
Social Media advertising
 Facebook's "Sponsored Stories"
 LinkedIn's "Sponsored Updates"
 Twitter's "Promoted Tweets"
© 2015 IBM Corporation7
Online advertising is big business
© 2015 IBM Corporation8
Types of online advertising
 Retargeting – Using cookies, track if a user left a webpage without making a purchase and
retarget the user with ads from that site
 Behavioral targeting – Data related to user’s online activity is collected from multiple
websites, thus creating a detailed picture of the user’s interests to deliver more targeted
advertising
 Contextual advertising – Display ads related to the content of the webpage
 Geo targeting – Ads are presented based on the user’s suspected geography
© 2015 IBM Corporation9
Online Advertising – The Players
Publishers:
 Earn revenue by displaying ads on their sites
 Google, Wall Street Journal, Twitter
Advertisers:
 Pay for their ads to be displayed on publisher sites
 Goal is to increase business via advertising
Matchmakers:
 Match publishers with advertisers
 Interaction with advertisers and publishers occurs real time
© 2015 IBM Corporation10
Revenue models from online advertising
 CPM (Cost-Per-Mille): is an inventory based pricing model. Is when the price is based on
1,000 impressions.
 CPC (Cost-Per-Click): Is a performance-based metric. This means the Publisher only gets
paid when (and if) a user clicks on an ad
 CPA (Cost Per Action): Best deal of all for Advertisers in terms of risk because they only
pay for media when it results in a sale
© 2015 IBM Corporation11
Click-through rate (CTR)
 Ratio of users who click on an ad to the number of total users who view the ad
 Used to measure the success of an online advertising campaign
 Very first online ad shown for AT&T on website HotWired has a 44% Click-through rate
 Today, typical click through rate is less than 1%
 In a pay-per-click setting, revenue can be maximized by choosing to display ads that have
the maximum CTR, hence the problem of CTR Prediction.
© 2015 IBM Corporation12
Predict CTR – a Scalable Machine Learning success story
Predict conditional probability that the ad will be clicked by the user given the predictive
features of ads
Predictive features are:
 Ad’s historical performance
 Advertiser and ad content info
 Publisher info
 User Info (eg: search/ click history)
Data set is high dimensional, sparse and skewed:
 Hundreds of millions of online users
 Millions of unique publisher pages to display ads
 Millions of unique ads to display
 Very few ads get clicked by users
Dataset and features
© 2015 IBM Corporation14
About the dataset used
 Sample of the dataset used for the Display Advertising Challenge hosted by Kaggle:
https://www.kaggle.com/c/criteo-display-ad-challenge/
 Consists of a portion of Criteo's traffic over a period of 7 days.
© 2015 IBM Corporation15
Dataset Features
Data fields
 Label - Target variable that indicates if an ad was clicked (1) or not (0).
 I1-I13 - A total of 13 columns of integer features (mostly count features).
 C1-C26 - A total of 26 columns of categorical features. The values of these features have
been hashed onto 32 bits for anonymization purposes.
Spark MLlib and Pipeline API
© 2015 IBM Corporation17
Spark MLlib
 Spark’s machine learning (ML) library.
 Goal is to make ML scalable and easy
 Includes common learning algorithms and utilities, like classification, regression, clustering,
collaborative filtering, dimensionality reduction
 Includes lower-level optimization primitives and higher level pipeline APIs
 Divided into two packages
• Spark.mllib: original API built on top of RDDs
• Spark.ml: provides higher level API built on top of DataFrames for constructing ML
pipelines
© 2015 IBM Corporation18
ML pipelines for complex ML workflows
 DataFrame: Equivalent to a relational table in Spark SQL
 Transformer: An algorithm that transforms one DataFrame into another
 Estimator: An algorithm that can be fit on a DataFrame to produce a Transformer
 Pipeline: Chains multiple transformers and estimators to
produce a ML workflow
 Parameter: Common API for specifying parameters
MLlib Pipeline for Click Through Rate
Prediction
© 2015 IBM Corporation20
Overview of ML pipeline for Click Through Rate Prediction
© 2015 IBM Corporation21
Step 1 - Parse Data into Spark SQL DataFrames
 Spark SQL converts a RDD of Row objects into a DataFrame
 Rows are constructed by passing key/value pairs where keys define the column names of
the table
 The data types are inferred by looking at the data
© 2015 IBM Corporation22
Step 2 – Feature Transformer using StringIndexer
 Encodes a string column to a column of indices
 Indices are from 0 to max number of distinct string values, ordered by frequencies
 If column is numeric, cast it to string and index string values
© 2015 IBM Corporation23
Step 3 - Feature Transformer using One Hot Encoding
 Maps a column of label indices to a column of binary vectors
 Allows algorithms which expect continuous features, to use categorical features
© 2015 IBM Corporation24
Step 4 – Feature Selector using Vector Assembler
 Transformer that combines a given list of columns into a single vector column
 Useful for combining raw features and features generated by different feature transformers
into a single feature vector
© 2015 IBM Corporation25
Step 5 – Train a model using Estimator Logistic Regression
© 2015 IBM Corporation26
Apply the pipeline to make predictions
Demo
© 2015 IBM Corporation28
Parameter Tuning using CrossValidator or TrainValidationSplit
© 2015 IBM Corporation29
Some alternatives:
 Use Hashed features instead of OHE
 Use Log loss evaluation or ROC to evaluate Logistic Regression
 Perform feature selection
 Use Naïve Bayes or other binary classification algorithms
© 2015 IBM Corporation30
Thank you!
© 2015 IBM Corporation31
References:
 Wikipedia: https://en.wikipedia.org/wiki/Online_advertising
 BerkeleyX: CS190.1x Scalable Machine Learning (https://courses.edx.org/)
 Big Data University: Spark Fundamentals 1 (https://bigdatauniversity.com/courses/spark-
fundamentals/)
 Big Data University: Spark Fundamentals 2 (http://bigdatauniversity.com/courses/spark-
fundamentals-ii/)
 Learning Spark: Lightning-Fast Big Data Analysis (Book by Andy Konwinski, Holden Karau,
and Patrick Wendell)
 Advanced Analytics with Spark: Patterns for Learning from Data at Scale (Book by Josh
Wills, Sandy Ryza, Sean Owen, and Uri Laserson)
 Spark Programming Guide (http://spark.apache.org/docs/latest/programming-guide.html)

More Related Content

What's hot

Multi-tenant Database Design for SaaS
Multi-tenant Database Design for SaaSMulti-tenant Database Design for SaaS
Multi-tenant Database Design for SaaSVõ Duy Tuấn
 
Crafting a Digital Marketing Plan
Crafting a Digital Marketing PlanCrafting a Digital Marketing Plan
Crafting a Digital Marketing PlanCatherine Quiambao
 
Homepage Personalization at Spotify
Homepage Personalization at SpotifyHomepage Personalization at Spotify
Homepage Personalization at SpotifyOguz Semerci
 
Uber eats Competitive Analysis
Uber eats Competitive AnalysisUber eats Competitive Analysis
Uber eats Competitive AnalysisRitu Jain
 
Metrics, Engagement & Personalization
Metrics, Engagement & Personalization Metrics, Engagement & Personalization
Metrics, Engagement & Personalization Mounia Lalmas-Roelleke
 
Introduction to seo keyword and content strategy
Introduction to seo keyword and content strategyIntroduction to seo keyword and content strategy
Introduction to seo keyword and content strategyROI Logic
 
Digital Marketing Strategy
Digital Marketing Strategy Digital Marketing Strategy
Digital Marketing Strategy Sabareesan A
 
Digital Marketing KPIs Measurement - By Waqas Lakhani
Digital Marketing KPIs Measurement - By Waqas LakhaniDigital Marketing KPIs Measurement - By Waqas Lakhani
Digital Marketing KPIs Measurement - By Waqas LakhaniWaqas Lakhani
 
digital-marketing-proposal.pdf
digital-marketing-proposal.pdfdigital-marketing-proposal.pdf
digital-marketing-proposal.pdfChandan738306
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Pitch deck - Plataforma Delivery It - Cardápio digital, Marketing, Pedidos on...
Pitch deck - Plataforma Delivery It - Cardápio digital, Marketing, Pedidos on...Pitch deck - Plataforma Delivery It - Cardápio digital, Marketing, Pedidos on...
Pitch deck - Plataforma Delivery It - Cardápio digital, Marketing, Pedidos on...Agência Dix
 
Alibaba ppt
Alibaba pptAlibaba ppt
Alibaba pptcskjin
 
content marketing - 2023 -
content marketing - 2023 - content marketing - 2023 -
content marketing - 2023 - Lamiaa Ahmed
 
How Does Organic SEO Work?
How Does Organic SEO Work?How Does Organic SEO Work?
How Does Organic SEO Work?ITS-SEO.com
 
OSMC 2023 | Large-scale logging made easy by Alexandr Valialkin
OSMC 2023 | Large-scale logging made easy by Alexandr ValialkinOSMC 2023 | Large-scale logging made easy by Alexandr Valialkin
OSMC 2023 | Large-scale logging made easy by Alexandr ValialkinNETWAYS
 
Awesome e commerce-shopify
Awesome e commerce-shopifyAwesome e commerce-shopify
Awesome e commerce-shopifyMichael Trang
 
Mailchimp presentation
Mailchimp presentationMailchimp presentation
Mailchimp presentationVAExpert Shane
 

What's hot (20)

Multi-tenant Database Design for SaaS
Multi-tenant Database Design for SaaSMulti-tenant Database Design for SaaS
Multi-tenant Database Design for SaaS
 
Crafting a Digital Marketing Plan
Crafting a Digital Marketing PlanCrafting a Digital Marketing Plan
Crafting a Digital Marketing Plan
 
Homepage Personalization at Spotify
Homepage Personalization at SpotifyHomepage Personalization at Spotify
Homepage Personalization at Spotify
 
Uber eats Competitive Analysis
Uber eats Competitive AnalysisUber eats Competitive Analysis
Uber eats Competitive Analysis
 
Metrics, Engagement & Personalization
Metrics, Engagement & Personalization Metrics, Engagement & Personalization
Metrics, Engagement & Personalization
 
Introduction to seo keyword and content strategy
Introduction to seo keyword and content strategyIntroduction to seo keyword and content strategy
Introduction to seo keyword and content strategy
 
Digital Marketing Strategy
Digital Marketing Strategy Digital Marketing Strategy
Digital Marketing Strategy
 
Digital Marketing KPIs Measurement - By Waqas Lakhani
Digital Marketing KPIs Measurement - By Waqas LakhaniDigital Marketing KPIs Measurement - By Waqas Lakhani
Digital Marketing KPIs Measurement - By Waqas Lakhani
 
Digital Marketing
Digital MarketingDigital Marketing
Digital Marketing
 
digital-marketing-proposal.pdf
digital-marketing-proposal.pdfdigital-marketing-proposal.pdf
digital-marketing-proposal.pdf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Facebook Ads
Facebook AdsFacebook Ads
Facebook Ads
 
Pitch deck - Plataforma Delivery It - Cardápio digital, Marketing, Pedidos on...
Pitch deck - Plataforma Delivery It - Cardápio digital, Marketing, Pedidos on...Pitch deck - Plataforma Delivery It - Cardápio digital, Marketing, Pedidos on...
Pitch deck - Plataforma Delivery It - Cardápio digital, Marketing, Pedidos on...
 
Alibaba ppt
Alibaba pptAlibaba ppt
Alibaba ppt
 
content marketing - 2023 -
content marketing - 2023 - content marketing - 2023 -
content marketing - 2023 -
 
How Does Organic SEO Work?
How Does Organic SEO Work?How Does Organic SEO Work?
How Does Organic SEO Work?
 
Google AdWords
Google AdWordsGoogle AdWords
Google AdWords
 
OSMC 2023 | Large-scale logging made easy by Alexandr Valialkin
OSMC 2023 | Large-scale logging made easy by Alexandr ValialkinOSMC 2023 | Large-scale logging made easy by Alexandr Valialkin
OSMC 2023 | Large-scale logging made easy by Alexandr Valialkin
 
Awesome e commerce-shopify
Awesome e commerce-shopifyAwesome e commerce-shopify
Awesome e commerce-shopify
 
Mailchimp presentation
Mailchimp presentationMailchimp presentation
Mailchimp presentation
 

Similar to CTR Prediction using Spark Machine Learning Pipelines

Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...Databricks
 
Hybrid cloud-cloud-services-white-paper-external-apw12358usen-20180516
Hybrid cloud-cloud-services-white-paper-external-apw12358usen-20180516Hybrid cloud-cloud-services-white-paper-external-apw12358usen-20180516
Hybrid cloud-cloud-services-white-paper-external-apw12358usen-20180516Tanjina Prema
 
API and Platform Strategies to Win in Global and Local Markets
API and Platform Strategies to Win in Global and Local MarketsAPI and Platform Strategies to Win in Global and Local Markets
API and Platform Strategies to Win in Global and Local MarketsAxway
 
Iag api management architect presentation
Iag   api management architect presentationIag   api management architect presentation
Iag api management architect presentationsflynn073
 
Building SaaS on AWS Serverless
Building SaaS on AWS ServerlessBuilding SaaS on AWS Serverless
Building SaaS on AWS ServerlessHolger Reinhardt
 
Building a SaaS on AWS Serverless
Building a SaaS on AWS ServerlessBuilding a SaaS on AWS Serverless
Building a SaaS on AWS ServerlessAdello
 
20141209 meetup hassan
20141209 meetup hassan20141209 meetup hassan
20141209 meetup hassanNanda Kishore
 
Using Machine Learning to Govern Kafka Clients with Shu Wang & UmaMahesh Sistu
Using Machine Learning to Govern Kafka Clients with Shu Wang & UmaMahesh SistuUsing Machine Learning to Govern Kafka Clients with Shu Wang & UmaMahesh Sistu
Using Machine Learning to Govern Kafka Clients with Shu Wang & UmaMahesh SistuHostedbyConfluent
 
Era of APIs: Why do we need an API strategy?
Era of APIs: Why do we need an API strategy?Era of APIs: Why do we need an API strategy?
Era of APIs: Why do we need an API strategy?Bala Iyer
 
IBM APM for Hybrid Applications
IBM APM for Hybrid ApplicationsIBM APM for Hybrid Applications
IBM APM for Hybrid ApplicationsMatthew Cheah
 
API Economy: 2016 Horizonwatch Trend Brief
API Economy:  2016 Horizonwatch Trend BriefAPI Economy:  2016 Horizonwatch Trend Brief
API Economy: 2016 Horizonwatch Trend BriefBill Chamberlin
 
CompeteWorkshop_ALLSLIDESv2
CompeteWorkshop_ALLSLIDESv2CompeteWorkshop_ALLSLIDESv2
CompeteWorkshop_ALLSLIDESv2Adam Hecht
 
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Databricks
 
Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...
Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...
Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...MLconf
 
How to Execute a Successful API Strategy
How to Execute a Successful API StrategyHow to Execute a Successful API Strategy
How to Execute a Successful API StrategyMatt McLarty
 
UK Integration WebSphere User Group - MultiSpeed IT
UK Integration WebSphere User Group - MultiSpeed ITUK Integration WebSphere User Group - MultiSpeed IT
UK Integration WebSphere User Group - MultiSpeed ITAndyHumphreys
 
Manage your ap is securely and easily ibm apim 4.0
Manage your ap is securely and easily ibm apim 4.0Manage your ap is securely and easily ibm apim 4.0
Manage your ap is securely and easily ibm apim 4.0sflynn073
 
API Management
API ManagementAPI Management
API ManagementProlifics
 

Similar to CTR Prediction using Spark Machine Learning Pipelines (20)

Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
 
Hybrid cloud-cloud-services-white-paper-external-apw12358usen-20180516
Hybrid cloud-cloud-services-white-paper-external-apw12358usen-20180516Hybrid cloud-cloud-services-white-paper-external-apw12358usen-20180516
Hybrid cloud-cloud-services-white-paper-external-apw12358usen-20180516
 
API and Platform Strategies to Win in Global and Local Markets
API and Platform Strategies to Win in Global and Local MarketsAPI and Platform Strategies to Win in Global and Local Markets
API and Platform Strategies to Win in Global and Local Markets
 
Iag api management architect presentation
Iag   api management architect presentationIag   api management architect presentation
Iag api management architect presentation
 
Building SaaS on AWS Serverless
Building SaaS on AWS ServerlessBuilding SaaS on AWS Serverless
Building SaaS on AWS Serverless
 
Building a SaaS on AWS Serverless
Building a SaaS on AWS ServerlessBuilding a SaaS on AWS Serverless
Building a SaaS on AWS Serverless
 
20141209 meetup hassan
20141209 meetup hassan20141209 meetup hassan
20141209 meetup hassan
 
Using Machine Learning to Govern Kafka Clients with Shu Wang & UmaMahesh Sistu
Using Machine Learning to Govern Kafka Clients with Shu Wang & UmaMahesh SistuUsing Machine Learning to Govern Kafka Clients with Shu Wang & UmaMahesh Sistu
Using Machine Learning to Govern Kafka Clients with Shu Wang & UmaMahesh Sistu
 
Era of APIs: Why do we need an API strategy?
Era of APIs: Why do we need an API strategy?Era of APIs: Why do we need an API strategy?
Era of APIs: Why do we need an API strategy?
 
AdvertOne
AdvertOneAdvertOne
AdvertOne
 
Ad exchange product description
Ad exchange product descriptionAd exchange product description
Ad exchange product description
 
IBM APM for Hybrid Applications
IBM APM for Hybrid ApplicationsIBM APM for Hybrid Applications
IBM APM for Hybrid Applications
 
API Economy: 2016 Horizonwatch Trend Brief
API Economy:  2016 Horizonwatch Trend BriefAPI Economy:  2016 Horizonwatch Trend Brief
API Economy: 2016 Horizonwatch Trend Brief
 
CompeteWorkshop_ALLSLIDESv2
CompeteWorkshop_ALLSLIDESv2CompeteWorkshop_ALLSLIDESv2
CompeteWorkshop_ALLSLIDESv2
 
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
 
Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...
Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...
Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...
 
How to Execute a Successful API Strategy
How to Execute a Successful API StrategyHow to Execute a Successful API Strategy
How to Execute a Successful API Strategy
 
UK Integration WebSphere User Group - MultiSpeed IT
UK Integration WebSphere User Group - MultiSpeed ITUK Integration WebSphere User Group - MultiSpeed IT
UK Integration WebSphere User Group - MultiSpeed IT
 
Manage your ap is securely and easily ibm apim 4.0
Manage your ap is securely and easily ibm apim 4.0Manage your ap is securely and easily ibm apim 4.0
Manage your ap is securely and easily ibm apim 4.0
 
API Management
API ManagementAPI Management
API Management
 

Recently uploaded

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 

CTR Prediction using Spark Machine Learning Pipelines

  • 1. © 2015 IBM Corporation Predict whether an online ad will be clicked using Spark MLlib pipelines Presenter: Manisha Sule, IBM Spark Enablement
  • 2. © 2015 IBM Corporation2 Outline  Online advertising and click through rate prediction  Data set and features  Spark MLlib and the Pipeline API  MLlib pipeline for Click Through Rate Prediction  Questions
  • 4. © 2015 IBM Corporation4 Online advertising
  • 5. © 2015 IBM Corporation5 Online advertising on search engines
  • 6. © 2015 IBM Corporation6 Social Media advertising  Facebook's "Sponsored Stories"  LinkedIn's "Sponsored Updates"  Twitter's "Promoted Tweets"
  • 7. © 2015 IBM Corporation7 Online advertising is big business
  • 8. © 2015 IBM Corporation8 Types of online advertising  Retargeting – Using cookies, track if a user left a webpage without making a purchase and retarget the user with ads from that site  Behavioral targeting – Data related to user’s online activity is collected from multiple websites, thus creating a detailed picture of the user’s interests to deliver more targeted advertising  Contextual advertising – Display ads related to the content of the webpage  Geo targeting – Ads are presented based on the user’s suspected geography
  • 9. © 2015 IBM Corporation9 Online Advertising – The Players Publishers:  Earn revenue by displaying ads on their sites  Google, Wall Street Journal, Twitter Advertisers:  Pay for their ads to be displayed on publisher sites  Goal is to increase business via advertising Matchmakers:  Match publishers with advertisers  Interaction with advertisers and publishers occurs real time
  • 10. © 2015 IBM Corporation10 Revenue models from online advertising  CPM (Cost-Per-Mille): is an inventory based pricing model. Is when the price is based on 1,000 impressions.  CPC (Cost-Per-Click): Is a performance-based metric. This means the Publisher only gets paid when (and if) a user clicks on an ad  CPA (Cost Per Action): Best deal of all for Advertisers in terms of risk because they only pay for media when it results in a sale
  • 11. © 2015 IBM Corporation11 Click-through rate (CTR)  Ratio of users who click on an ad to the number of total users who view the ad  Used to measure the success of an online advertising campaign  Very first online ad shown for AT&T on website HotWired has a 44% Click-through rate  Today, typical click through rate is less than 1%  In a pay-per-click setting, revenue can be maximized by choosing to display ads that have the maximum CTR, hence the problem of CTR Prediction.
  • 12. © 2015 IBM Corporation12 Predict CTR – a Scalable Machine Learning success story Predict conditional probability that the ad will be clicked by the user given the predictive features of ads Predictive features are:  Ad’s historical performance  Advertiser and ad content info  Publisher info  User Info (eg: search/ click history) Data set is high dimensional, sparse and skewed:  Hundreds of millions of online users  Millions of unique publisher pages to display ads  Millions of unique ads to display  Very few ads get clicked by users
  • 14. © 2015 IBM Corporation14 About the dataset used  Sample of the dataset used for the Display Advertising Challenge hosted by Kaggle: https://www.kaggle.com/c/criteo-display-ad-challenge/  Consists of a portion of Criteo's traffic over a period of 7 days.
  • 15. © 2015 IBM Corporation15 Dataset Features Data fields  Label - Target variable that indicates if an ad was clicked (1) or not (0).  I1-I13 - A total of 13 columns of integer features (mostly count features).  C1-C26 - A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes.
  • 16. Spark MLlib and Pipeline API
  • 17. © 2015 IBM Corporation17 Spark MLlib  Spark’s machine learning (ML) library.  Goal is to make ML scalable and easy  Includes common learning algorithms and utilities, like classification, regression, clustering, collaborative filtering, dimensionality reduction  Includes lower-level optimization primitives and higher level pipeline APIs  Divided into two packages • Spark.mllib: original API built on top of RDDs • Spark.ml: provides higher level API built on top of DataFrames for constructing ML pipelines
  • 18. © 2015 IBM Corporation18 ML pipelines for complex ML workflows  DataFrame: Equivalent to a relational table in Spark SQL  Transformer: An algorithm that transforms one DataFrame into another  Estimator: An algorithm that can be fit on a DataFrame to produce a Transformer  Pipeline: Chains multiple transformers and estimators to produce a ML workflow  Parameter: Common API for specifying parameters
  • 19. MLlib Pipeline for Click Through Rate Prediction
  • 20. © 2015 IBM Corporation20 Overview of ML pipeline for Click Through Rate Prediction
  • 21. © 2015 IBM Corporation21 Step 1 - Parse Data into Spark SQL DataFrames  Spark SQL converts a RDD of Row objects into a DataFrame  Rows are constructed by passing key/value pairs where keys define the column names of the table  The data types are inferred by looking at the data
  • 22. © 2015 IBM Corporation22 Step 2 – Feature Transformer using StringIndexer  Encodes a string column to a column of indices  Indices are from 0 to max number of distinct string values, ordered by frequencies  If column is numeric, cast it to string and index string values
  • 23. © 2015 IBM Corporation23 Step 3 - Feature Transformer using One Hot Encoding  Maps a column of label indices to a column of binary vectors  Allows algorithms which expect continuous features, to use categorical features
  • 24. © 2015 IBM Corporation24 Step 4 – Feature Selector using Vector Assembler  Transformer that combines a given list of columns into a single vector column  Useful for combining raw features and features generated by different feature transformers into a single feature vector
  • 25. © 2015 IBM Corporation25 Step 5 – Train a model using Estimator Logistic Regression
  • 26. © 2015 IBM Corporation26 Apply the pipeline to make predictions
  • 27. Demo
  • 28. © 2015 IBM Corporation28 Parameter Tuning using CrossValidator or TrainValidationSplit
  • 29. © 2015 IBM Corporation29 Some alternatives:  Use Hashed features instead of OHE  Use Log loss evaluation or ROC to evaluate Logistic Regression  Perform feature selection  Use Naïve Bayes or other binary classification algorithms
  • 30. © 2015 IBM Corporation30 Thank you!
  • 31. © 2015 IBM Corporation31 References:  Wikipedia: https://en.wikipedia.org/wiki/Online_advertising  BerkeleyX: CS190.1x Scalable Machine Learning (https://courses.edx.org/)  Big Data University: Spark Fundamentals 1 (https://bigdatauniversity.com/courses/spark- fundamentals/)  Big Data University: Spark Fundamentals 2 (http://bigdatauniversity.com/courses/spark- fundamentals-ii/)  Learning Spark: Lightning-Fast Big Data Analysis (Book by Andy Konwinski, Holden Karau, and Patrick Wendell)  Advanced Analytics with Spark: Patterns for Learning from Data at Scale (Book by Josh Wills, Sandy Ryza, Sean Owen, and Uri Laserson)  Spark Programming Guide (http://spark.apache.org/docs/latest/programming-guide.html)

Editor's Notes

  1. 1