For Roularta, a news & media publishing company, it is of a great importance to understand reader behavior and what content attract, engage and convert readers. At Roularta, we have built an AI-driven article quality scoring solution on using Spark for parallelized compute, Delta for efficient data lake use, BERT for NLP and MLflow for model management. The article quality score solution is an NLP-based ML model which gives for every article published – a calculated and forecasted article quality score based on 3 dimensions (conversion, traffic and engagement).
Building an ML Tool to predict Article Quality Scores using Delta & MLFlow
1. Building an ML tool to predict
article quality scores using Delta
& MLflow
Ivana Pejeva
Data Scientist @ element61
2. What’s the challenge
For past articles, Roularta wants to…
▪ analyze which article did well in
bringing traffic, engagement &
conversions
Quality scoring of an article Predicting quality of an article
Roularta wants to know the aricle quality...
... but there is an infinite amount of possible KPIs one
needs to track
...and the goal is to simplify/standardize/automate an
objective measurement
For new articles, Roularta wants to…
▪ predict which article will bring good
conversions, traffic and engagement
3. What we offer editorial teams
▪ Calculates article scores on historical articles
▪ Predicts article scores on new articles
CMS
Content Behavior
Reader behavior
Structured streaming
Structured streaming
Azure Data Lake Gen2
Feature extraction
Predictive Model
Article Score
Score calculation
Data tool which
4. What do we want to predict
How much traffic is an
article bringing to the
site?
Is this article going to
bring good
conversions?
Will this article keep
people engaged?
Traffic Score Conversion Score Engagement Score
5. The flow for calculating and predicting article
scores
Gather all pageviews
and content data
• Number of pageviews
• Pageviews from Social
media
• Article author
• Publishing date
• …
Calculate series of
traffic, engagement
and conversion scores
Traffic Score: 0.7
Engagement Score:
0.2
Conversion Score: 0.5
Calculate quality
stars for each
article
Traffic Score:
Engagement Score:
Conversion Score:
Calculate quality
score
Calculate overall
quality score based on
traffic, engagement
and conversion score
Predict scores
Predict the traffic,
engagement and
conversion scores on
new articles
6. What’s the data
▪ Reading Behavior
▪ Coming from a CDP tracker
▪ Pageviews
▪ Content Behavior
▪ Coming from the CMS
▪ Details on the content
CMS
Content Management System
7. How we prepared the data
▪ Number of pageviews
▪ Time on page
▪ Social media shares
▪ Registrations
▪ Subscriptions
▪ Bounce rate
▪ …
▪ Article text
▪ Publication time
▪ Author
▪ Tags in article
▪ How long is the article
▪ Topic of article
▪ …
Calculate article scores Predict article scores
8. Calculation of article scores is a computationally
intensive job
▪ A lot of the metrics require looking
at a specific window of data from
millions of rows
▪ e.g. calculating the number of visitors
to an article where the user was not
seen on the site for the past 30 days
▪ This would require to always look
at a specific window of data for
each article
▪ Very expensive operation as we
are looking over millions of rows
▪ Keep intermediate delta lake
tables
▪ e.g. Keep the visitors from the last 30
days in an intermediate delta table
▪ Avoid using windowing function over
a huge amount of data
▪ > 10x improve in performance
9. The role of Delta
▪ ~2 million pageviews per day we need to process in streaming mode
▪ ~ 250k number of articles
▪ 50 brands
▪ Delta in silver and gold layer for data ingestion
▪ Intermediate delta tables for extracted features
▪ Time travel to recreate ML experiments
Delta to accelerate data ingestion and feature engineering pipelines
10. The feature extraction process
CMS
Content Behavior
Reader behavior
Calculate traffic,
engagement and
conversion scores
Create features for
each article Add labels to articles
• Is it a short/long article?
• Does the teaser have images?
• Tags in article
• Authors of article
• When was it published?
• Teaser text
11. NLP BERT language model
▪ BERTje – a dutch based BERT model
▪ Extract representations from article text
▪ Leverage pandasUDF for better performance
▪ Highly improved ML model performance
For extracting features from articles text
12. How we build the predictive model
CMS
Content Behavior
Reader behavior
Calculate traffic,
engagement and
conversion scores
Create features
for each article
Add labels to
articles
• Is it a short/long article?
• Does the teaser have images?
• Tags in article
• Authors of article
• When was it published?
• Teaser text
ML Pipeline for
feature
transformation
• Binarize features
• Encode string column to label indices
• TF-IDF to reflects importance of words
in tags, topics and collections of articles
• Scale data with mean and std
• Create feature vectors
Train Model
• Multi-class classification problem
(predict the number of stars (1-5)
• Random Forest Classifier
• Cross Validation and log model
metrics with
• Multiclass Classification Evaluator
Register
model
13. The role of MLflow
▪ Train ML models per brand and per score
▪ We need to
▪ Track performance for all models
▪ Promote new model versions fast
▪ Serve models to score new data
For model tracking, register and serve
14. Results
Method:
▪ Predictions made on articles from one brand
▪ Total number of articles 2368
Results:
▪ In 93% of the cases a quality article (4 or 5 stars) is
given 4 or 5 stars by the model
▪ A low-quality article (1 or 2 stars) is given 4 or 5 stars in
only 2% of the cases
Details:
▪ Most important features are Teaser text and topics
15. Summary
▪ Use Delta to accelerate data ingestion and feature extraction
▪ Use NLP BERT for text representations from articles
▪ Use Mlflow for tracking, registering and serving ML models