Building an ML Tool to predict Article Quality Scores using Delta & MLFlow

Building an ML tool to predict
article quality scores using Delta
& MLflow
Ivana Pejeva
Data Scientist @ element61

What’s the challenge
For past articles, Roularta wants to…
▪ analyze which article did well in
bringing traffic, engagement &
conversions
Quality scoring of an article Predicting quality of an article
Roularta wants to know the aricle quality...
... but there is an infinite amount of possible KPIs one
needs to track
...and the goal is to simplify/standardize/automate an
objective measurement
For new articles, Roularta wants to…
▪ predict which article will bring good
conversions, traffic and engagement

What we offer editorial teams
▪ Calculates article scores on historical articles
▪ Predicts article scores on new articles
CMS
Content Behavior
Reader behavior
Structured streaming
Structured streaming
Azure Data Lake Gen2
Feature extraction
Predictive Model
Article Score
Score calculation
Data tool which

What do we want to predict
How much traffic is an
article bringing to the
site?
Is this article going to
bring good
conversions?
Will this article keep
people engaged?
Traffic Score Conversion Score Engagement Score

The flow for calculating and predicting article
scores
Gather all pageviews
and content data
• Number of pageviews
• Pageviews from Social
media
• Article author
• Publishing date
• …
Calculate series of
traffic, engagement
and conversion scores
Traffic Score: 0.7
Engagement Score:
0.2
Conversion Score: 0.5
Calculate quality
stars for each
article
Traffic Score:
Engagement Score:
Conversion Score:
Calculate quality
score
Calculate overall
quality score based on
traffic, engagement
and conversion score
Predict scores
Predict the traffic,
engagement and
conversion scores on
new articles

What’s the data
▪ Reading Behavior
▪ Coming from a CDP tracker
▪ Pageviews
▪ Content Behavior
▪ Coming from the CMS
▪ Details on the content
CMS
Content Management System

How we prepared the data
▪ Number of pageviews
▪ Time on page
▪ Social media shares
▪ Registrations
▪ Subscriptions
▪ Bounce rate
▪ …
▪ Article text
▪ Publication time
▪ Author
▪ Tags in article
▪ How long is the article
▪ Topic of article
▪ …
Calculate article scores Predict article scores

Calculation of article scores is a computationally
intensive job
▪ A lot of the metrics require looking
at a specific window of data from
millions of rows
▪ e.g. calculating the number of visitors
to an article where the user was not
seen on the site for the past 30 days
▪ This would require to always look
at a specific window of data for
each article
▪ Very expensive operation as we
are looking over millions of rows
▪ Keep intermediate delta lake
tables
▪ e.g. Keep the visitors from the last 30
days in an intermediate delta table
▪ Avoid using windowing function over
a huge amount of data
▪ > 10x improve in performance

The role of Delta
▪ ~2 million pageviews per day we need to process in streaming mode
▪ ~ 250k number of articles
▪ 50 brands
▪ Delta in silver and gold layer for data ingestion
▪ Intermediate delta tables for extracted features
▪ Time travel to recreate ML experiments
Delta to accelerate data ingestion and feature engineering pipelines

The feature extraction process
CMS
Content Behavior
Reader behavior
Calculate traffic,
engagement and
conversion scores
Create features for
each article Add labels to articles
• Is it a short/long article?
• Does the teaser have images?
• Tags in article
• Authors of article
• When was it published?
• Teaser text

NLP BERT language model
▪ BERTje – a dutch based BERT model
▪ Extract representations from article text
▪ Leverage pandasUDF for better performance
▪ Highly improved ML model performance
For extracting features from articles text

How we build the predictive model
CMS
Content Behavior
Reader behavior
Calculate traffic,
engagement and
conversion scores
Create features
for each article
Add labels to
articles
• Is it a short/long article?
• Does the teaser have images?
• Tags in article
• Authors of article
• When was it published?
• Teaser text
ML Pipeline for
feature
transformation
• Binarize features
• Encode string column to label indices
• TF-IDF to reflects importance of words
in tags, topics and collections of articles
• Scale data with mean and std
• Create feature vectors
Train Model
• Multi-class classification problem
(predict the number of stars (1-5)
• Random Forest Classifier
• Cross Validation and log model
metrics with
• Multiclass Classification Evaluator
Register
model

The role of MLflow
▪ Train ML models per brand and per score
▪ We need to
▪ Track performance for all models
▪ Promote new model versions fast
▪ Serve models to score new data
For model tracking, register and serve

Results
Method:
▪ Predictions made on articles from one brand
▪ Total number of articles 2368
Results:
▪ In 93% of the cases a quality article (4 or 5 stars) is
given 4 or 5 stars by the model
▪ A low-quality article (1 or 2 stars) is given 4 or 5 stars in
only 2% of the cases
Details:
▪ Most important features are Teaser text and topics

Summary
▪ Use Delta to accelerate data ingestion and feature extraction
▪ Use NLP BERT for text representations from articles
▪ Use Mlflow for tracking, registering and serving ML models

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Building an ML Tool to predict Article Quality Scores using Delta & MLFlow

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building an ML Tool to predict Article Quality Scores using Delta & MLFlow

Similar to Building an ML Tool to predict Article Quality Scores using Delta & MLFlow (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Building an ML Tool to predict Article Quality Scores using Delta & MLFlow