SlideShare a Scribd company logo
1 of 16
Download to read offline
Building an ML tool to predict
article quality scores using Delta
& MLflow
Ivana Pejeva
Data Scientist @ element61
What’s the challenge
For past articles, Roularta wants to…
▪ analyze which article did well in
bringing traffic, engagement &
conversions
Quality scoring of an article Predicting quality of an article
Roularta wants to know the aricle quality...
... but there is an infinite amount of possible KPIs one
needs to track
...and the goal is to simplify/standardize/automate an
objective measurement
For new articles, Roularta wants to…
▪ predict which article will bring good
conversions, traffic and engagement
What we offer editorial teams
▪ Calculates article scores on historical articles
▪ Predicts article scores on new articles
CMS
Content Behavior
Reader behavior
Structured streaming
Structured streaming
Azure Data Lake Gen2
Feature extraction
Predictive Model
Article Score
Score calculation
Data tool which
What do we want to predict
How much traffic is an
article bringing to the
site?
Is this article going to
bring good
conversions?
Will this article keep
people engaged?
Traffic Score Conversion Score Engagement Score
The flow for calculating and predicting article
scores
Gather all pageviews
and content data
• Number of pageviews
• Pageviews from Social
media
• Article author
• Publishing date
• …
Calculate series of
traffic, engagement
and conversion scores
Traffic Score: 0.7
Engagement Score:
0.2
Conversion Score: 0.5
Calculate quality
stars for each
article
Traffic Score:
Engagement Score:
Conversion Score:
Calculate quality
score
Calculate overall
quality score based on
traffic, engagement
and conversion score
Predict scores
Predict the traffic,
engagement and
conversion scores on
new articles
What’s the data
▪ Reading Behavior
▪ Coming from a CDP tracker
▪ Pageviews
▪ Content Behavior
▪ Coming from the CMS
▪ Details on the content
CMS
Content Management System
How we prepared the data
▪ Number of pageviews
▪ Time on page
▪ Social media shares
▪ Registrations
▪ Subscriptions
▪ Bounce rate
▪ …
▪ Article text
▪ Publication time
▪ Author
▪ Tags in article
▪ How long is the article
▪ Topic of article
▪ …
Calculate article scores Predict article scores
Calculation of article scores is a computationally
intensive job
▪ A lot of the metrics require looking
at a specific window of data from
millions of rows
▪ e.g. calculating the number of visitors
to an article where the user was not
seen on the site for the past 30 days
▪ This would require to always look
at a specific window of data for
each article
▪ Very expensive operation as we
are looking over millions of rows
▪ Keep intermediate delta lake
tables
▪ e.g. Keep the visitors from the last 30
days in an intermediate delta table
▪ Avoid using windowing function over
a huge amount of data
▪ > 10x improve in performance
The role of Delta
▪ ~2 million pageviews per day we need to process in streaming mode
▪ ~ 250k number of articles
▪ 50 brands
▪ Delta in silver and gold layer for data ingestion
▪ Intermediate delta tables for extracted features
▪ Time travel to recreate ML experiments
Delta to accelerate data ingestion and feature engineering pipelines
The feature extraction process
CMS
Content Behavior
Reader behavior
Calculate traffic,
engagement and
conversion scores
Create features for
each article Add labels to articles
• Is it a short/long article?
• Does the teaser have images?
• Tags in article
• Authors of article
• When was it published?
• Teaser text
NLP BERT language model
▪ BERTje – a dutch based BERT model
▪ Extract representations from article text
▪ Leverage pandasUDF for better performance
▪ Highly improved ML model performance
For extracting features from articles text
How we build the predictive model
CMS
Content Behavior
Reader behavior
Calculate traffic,
engagement and
conversion scores
Create features
for each article
Add labels to
articles
• Is it a short/long article?
• Does the teaser have images?
• Tags in article
• Authors of article
• When was it published?
• Teaser text
ML Pipeline for
feature
transformation
• Binarize features
• Encode string column to label indices
• TF-IDF to reflects importance of words
in tags, topics and collections of articles
• Scale data with mean and std
• Create feature vectors
Train Model
• Multi-class classification problem
(predict the number of stars (1-5)
• Random Forest Classifier
• Cross Validation and log model
metrics with
• Multiclass Classification Evaluator
Register
model
The role of MLflow
▪ Train ML models per brand and per score
▪ We need to
▪ Track performance for all models
▪ Promote new model versions fast
▪ Serve models to score new data
For model tracking, register and serve
Results
Method:
▪ Predictions made on articles from one brand
▪ Total number of articles 2368
Results:
▪ In 93% of the cases a quality article (4 or 5 stars) is
given 4 or 5 stars by the model
▪ A low-quality article (1 or 2 stars) is given 4 or 5 stars in
only 2% of the cases
Details:
▪ Most important features are Teaser text and topics
Summary
▪ Use Delta to accelerate data ingestion and feature extraction
▪ Use NLP BERT for text representations from articles
▪ Use Mlflow for tracking, registering and serving ML models
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot

Session-based recommendations with recurrent neural networks
Session-based recommendations with recurrent neural networksSession-based recommendations with recurrent neural networks
Session-based recommendations with recurrent neural networksZimin Park
 
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive DataSumit Rangwala
 
System design for recommendations and search
System design for recommendations and searchSystem design for recommendations and search
System design for recommendations and searchEugene Yan Ziyou
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsBenjamin Le
 
Learning to rank
Learning to rankLearning to rank
Learning to rankBruce Kuo
 
Recent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix PerspectiveRecent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix PerspectiveJustin Basilico
 
시계열 분석의 이해와 활용
시계열 분석의 이해와 활용시계열 분석의 이해와 활용
시계열 분석의 이해와 활용Seung-Woo Kang
 
Causality without headaches
Causality without headachesCausality without headaches
Causality without headachesBenoît Rostykus
 
Beyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingBeyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingPierre Gutierrez
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for RecommendationOlivier Jeunen
 
Introduction to Recommendation Systems
Introduction to Recommendation SystemsIntroduction to Recommendation Systems
Introduction to Recommendation SystemsTrieu Nguyen
 
Warsaw Data Science - Factorization Machines Introduction
Warsaw Data Science -  Factorization Machines IntroductionWarsaw Data Science -  Factorization Machines Introduction
Warsaw Data Science - Factorization Machines IntroductionBartlomiej Twardowski
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsJustin Basilico
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
 
Incorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender SystemIncorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender SystemJacek Wasilewski
 
Practicing Data Science: A Collection of Case Studies
Practicing Data Science: A Collection of Case StudiesPracticing Data Science: A Collection of Case Studies
Practicing Data Science: A Collection of Case StudiesKNIMESlides
 
Modeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsModeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsMitul Tiwari
 
Recommendation system
Recommendation system Recommendation system
Recommendation system Vikrant Arya
 

What's hot (20)

Session-based recommendations with recurrent neural networks
Session-based recommendations with recurrent neural networksSession-based recommendations with recurrent neural networks
Session-based recommendations with recurrent neural networks
 
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
 
System design for recommendations and search
System design for recommendations and searchSystem design for recommendations and search
System design for recommendations and search
 
Learn to Rank search results
Learn to Rank search resultsLearn to Rank search results
Learn to Rank search results
 
Matrix Factorization
Matrix FactorizationMatrix Factorization
Matrix Factorization
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
 
Learning to rank
Learning to rankLearning to rank
Learning to rank
 
Recent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix PerspectiveRecent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix Perspective
 
시계열 분석의 이해와 활용
시계열 분석의 이해와 활용시계열 분석의 이해와 활용
시계열 분석의 이해와 활용
 
Causality without headaches
Causality without headachesCausality without headaches
Causality without headaches
 
Beyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingBeyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modeling
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
 
Introduction to Recommendation Systems
Introduction to Recommendation SystemsIntroduction to Recommendation Systems
Introduction to Recommendation Systems
 
Warsaw Data Science - Factorization Machines Introduction
Warsaw Data Science -  Factorization Machines IntroductionWarsaw Data Science -  Factorization Machines Introduction
Warsaw Data Science - Factorization Machines Introduction
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
Incorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender SystemIncorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender System
 
Practicing Data Science: A Collection of Case Studies
Practicing Data Science: A Collection of Case StudiesPracticing Data Science: A Collection of Case Studies
Practicing Data Science: A Collection of Case Studies
 
Modeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsModeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systems
 
Recommendation system
Recommendation system Recommendation system
Recommendation system
 

Similar to Building an ML Tool to predict Article Quality Scores using Delta & MLFlow

Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Best Practices with Sitecore
Best Practices with SitecoreBest Practices with Sitecore
Best Practices with SitecoreAnant Corporation
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
Oracle PIM: Phantasmal Item Descriptions in your Organization
Oracle PIM: Phantasmal Item Descriptions in your OrganizationOracle PIM: Phantasmal Item Descriptions in your Organization
Oracle PIM: Phantasmal Item Descriptions in your OrganizationAXIA Consulting Inc.
 
Structured Authoring for Business-Critical Content
Structured Authoring for Business-Critical ContentStructured Authoring for Business-Critical Content
Structured Authoring for Business-Critical ContentLavaCon
 
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachUsing OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachKent Graziano
 
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher ScientificEnabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher ScientificDatabricks
 
Building modern intranets with share point communication sites aug 2018 kloud
Building modern intranets with share point communication sites aug 2018   kloudBuilding modern intranets with share point communication sites aug 2018   kloud
Building modern intranets with share point communication sites aug 2018 kloudAsish Padhy
 
Out With the Old, in With the Open-source: Brainshark's Complete CMS Migration
Out With the Old, in With the Open-source: Brainshark's Complete CMS MigrationOut With the Old, in With the Open-source: Brainshark's Complete CMS Migration
Out With the Old, in With the Open-source: Brainshark's Complete CMS MigrationAcquia
 
Modernize Your Content Publishing Process with Smart Content
Modernize Your Content Publishing Process with Smart ContentModernize Your Content Publishing Process with Smart Content
Modernize Your Content Publishing Process with Smart ContentGavin Drake
 
Advanced Model Comparison and Automated Deployment Using ML
Advanced Model Comparison and Automated Deployment Using MLAdvanced Model Comparison and Automated Deployment Using ML
Advanced Model Comparison and Automated Deployment Using MLDatabricks
 
Prospectus Editing Tool (PET)
Prospectus Editing Tool (PET)Prospectus Editing Tool (PET)
Prospectus Editing Tool (PET)MrJ1971
 
SEF2013 - Create a Business Solution, Step by Step, with No Managed Code
SEF2013 - Create a Business Solution, Step by Step, with No Managed CodeSEF2013 - Create a Business Solution, Step by Step, with No Managed Code
SEF2013 - Create a Business Solution, Step by Step, with No Managed CodeMarc D Anderson
 
Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...
Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...
Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...OpenSource Connections
 
Moving advanced analytics to your sql server databases
Moving advanced analytics to your sql server databasesMoving advanced analytics to your sql server databases
Moving advanced analytics to your sql server databasesEnrico van de Laar
 
Shaking hands with the developer: How IT Communications can help you build a ...
Shaking hands with the developer: How IT Communications can help you build a ...Shaking hands with the developer: How IT Communications can help you build a ...
Shaking hands with the developer: How IT Communications can help you build a ...Sarah Khan
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 

Similar to Building an ML Tool to predict Article Quality Scores using Delta & MLFlow (20)

Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Best Practices with Sitecore
Best Practices with SitecoreBest Practices with Sitecore
Best Practices with Sitecore
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
NLP and the Web
NLP and the WebNLP and the Web
NLP and the Web
 
Oracle PIM: Phantasmal Item Descriptions in your Organization
Oracle PIM: Phantasmal Item Descriptions in your OrganizationOracle PIM: Phantasmal Item Descriptions in your Organization
Oracle PIM: Phantasmal Item Descriptions in your Organization
 
Structured Authoring for Business-Critical Content
Structured Authoring for Business-Critical ContentStructured Authoring for Business-Critical Content
Structured Authoring for Business-Critical Content
 
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachUsing OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
 
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher ScientificEnabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
 
Building modern intranets with share point communication sites aug 2018 kloud
Building modern intranets with share point communication sites aug 2018   kloudBuilding modern intranets with share point communication sites aug 2018   kloud
Building modern intranets with share point communication sites aug 2018 kloud
 
Webinar: The Slippery Slope of Migrating to SharePoint Online or On-Premise
Webinar: The Slippery Slope of Migrating to SharePoint Online or On-PremiseWebinar: The Slippery Slope of Migrating to SharePoint Online or On-Premise
Webinar: The Slippery Slope of Migrating to SharePoint Online or On-Premise
 
Out With the Old, in With the Open-source: Brainshark's Complete CMS Migration
Out With the Old, in With the Open-source: Brainshark's Complete CMS MigrationOut With the Old, in With the Open-source: Brainshark's Complete CMS Migration
Out With the Old, in With the Open-source: Brainshark's Complete CMS Migration
 
Modernize Your Content Publishing Process with Smart Content
Modernize Your Content Publishing Process with Smart ContentModernize Your Content Publishing Process with Smart Content
Modernize Your Content Publishing Process with Smart Content
 
Advanced Model Comparison and Automated Deployment Using ML
Advanced Model Comparison and Automated Deployment Using MLAdvanced Model Comparison and Automated Deployment Using ML
Advanced Model Comparison and Automated Deployment Using ML
 
Prospectus Editing Tool (PET)
Prospectus Editing Tool (PET)Prospectus Editing Tool (PET)
Prospectus Editing Tool (PET)
 
SEF2013 - Create a Business Solution, Step by Step, with No Managed Code
SEF2013 - Create a Business Solution, Step by Step, with No Managed CodeSEF2013 - Create a Business Solution, Step by Step, with No Managed Code
SEF2013 - Create a Business Solution, Step by Step, with No Managed Code
 
SharePoint Custom Development
SharePoint Custom DevelopmentSharePoint Custom Development
SharePoint Custom Development
 
Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...
Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...
Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...
 
Moving advanced analytics to your sql server databases
Moving advanced analytics to your sql server databasesMoving advanced analytics to your sql server databases
Moving advanced analytics to your sql server databases
 
Shaking hands with the developer: How IT Communications can help you build a ...
Shaking hands with the developer: How IT Communications can help you build a ...Shaking hands with the developer: How IT Communications can help you build a ...
Shaking hands with the developer: How IT Communications can help you build a ...
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 

Recently uploaded (20)

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 

Building an ML Tool to predict Article Quality Scores using Delta & MLFlow

  • 1. Building an ML tool to predict article quality scores using Delta & MLflow Ivana Pejeva Data Scientist @ element61
  • 2. What’s the challenge For past articles, Roularta wants to… ▪ analyze which article did well in bringing traffic, engagement & conversions Quality scoring of an article Predicting quality of an article Roularta wants to know the aricle quality... ... but there is an infinite amount of possible KPIs one needs to track ...and the goal is to simplify/standardize/automate an objective measurement For new articles, Roularta wants to… ▪ predict which article will bring good conversions, traffic and engagement
  • 3. What we offer editorial teams ▪ Calculates article scores on historical articles ▪ Predicts article scores on new articles CMS Content Behavior Reader behavior Structured streaming Structured streaming Azure Data Lake Gen2 Feature extraction Predictive Model Article Score Score calculation Data tool which
  • 4. What do we want to predict How much traffic is an article bringing to the site? Is this article going to bring good conversions? Will this article keep people engaged? Traffic Score Conversion Score Engagement Score
  • 5. The flow for calculating and predicting article scores Gather all pageviews and content data • Number of pageviews • Pageviews from Social media • Article author • Publishing date • … Calculate series of traffic, engagement and conversion scores Traffic Score: 0.7 Engagement Score: 0.2 Conversion Score: 0.5 Calculate quality stars for each article Traffic Score: Engagement Score: Conversion Score: Calculate quality score Calculate overall quality score based on traffic, engagement and conversion score Predict scores Predict the traffic, engagement and conversion scores on new articles
  • 6. What’s the data ▪ Reading Behavior ▪ Coming from a CDP tracker ▪ Pageviews ▪ Content Behavior ▪ Coming from the CMS ▪ Details on the content CMS Content Management System
  • 7. How we prepared the data ▪ Number of pageviews ▪ Time on page ▪ Social media shares ▪ Registrations ▪ Subscriptions ▪ Bounce rate ▪ … ▪ Article text ▪ Publication time ▪ Author ▪ Tags in article ▪ How long is the article ▪ Topic of article ▪ … Calculate article scores Predict article scores
  • 8. Calculation of article scores is a computationally intensive job ▪ A lot of the metrics require looking at a specific window of data from millions of rows ▪ e.g. calculating the number of visitors to an article where the user was not seen on the site for the past 30 days ▪ This would require to always look at a specific window of data for each article ▪ Very expensive operation as we are looking over millions of rows ▪ Keep intermediate delta lake tables ▪ e.g. Keep the visitors from the last 30 days in an intermediate delta table ▪ Avoid using windowing function over a huge amount of data ▪ > 10x improve in performance
  • 9. The role of Delta ▪ ~2 million pageviews per day we need to process in streaming mode ▪ ~ 250k number of articles ▪ 50 brands ▪ Delta in silver and gold layer for data ingestion ▪ Intermediate delta tables for extracted features ▪ Time travel to recreate ML experiments Delta to accelerate data ingestion and feature engineering pipelines
  • 10. The feature extraction process CMS Content Behavior Reader behavior Calculate traffic, engagement and conversion scores Create features for each article Add labels to articles • Is it a short/long article? • Does the teaser have images? • Tags in article • Authors of article • When was it published? • Teaser text
  • 11. NLP BERT language model ▪ BERTje – a dutch based BERT model ▪ Extract representations from article text ▪ Leverage pandasUDF for better performance ▪ Highly improved ML model performance For extracting features from articles text
  • 12. How we build the predictive model CMS Content Behavior Reader behavior Calculate traffic, engagement and conversion scores Create features for each article Add labels to articles • Is it a short/long article? • Does the teaser have images? • Tags in article • Authors of article • When was it published? • Teaser text ML Pipeline for feature transformation • Binarize features • Encode string column to label indices • TF-IDF to reflects importance of words in tags, topics and collections of articles • Scale data with mean and std • Create feature vectors Train Model • Multi-class classification problem (predict the number of stars (1-5) • Random Forest Classifier • Cross Validation and log model metrics with • Multiclass Classification Evaluator Register model
  • 13. The role of MLflow ▪ Train ML models per brand and per score ▪ We need to ▪ Track performance for all models ▪ Promote new model versions fast ▪ Serve models to score new data For model tracking, register and serve
  • 14. Results Method: ▪ Predictions made on articles from one brand ▪ Total number of articles 2368 Results: ▪ In 93% of the cases a quality article (4 or 5 stars) is given 4 or 5 stars by the model ▪ A low-quality article (1 or 2 stars) is given 4 or 5 stars in only 2% of the cases Details: ▪ Most important features are Teaser text and topics
  • 15. Summary ▪ Use Delta to accelerate data ingestion and feature extraction ▪ Use NLP BERT for text representations from articles ▪ Use Mlflow for tracking, registering and serving ML models
  • 16. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.