SlideShare a Scribd company logo
1 of 35
Download to read offline
Deploying Data Science
for Distribution of
Anne Bauer
anne.bauer@nytimes.com
Lead Data Scientist, NYTimes
PyData 20181017
Single copy newspaper distribution
1.  people still buy physical newspapers?
2.  algorithms
3.  experiments to test the algorithms
4.  ...we need to modify the algorithms
5.  app architecture
Single copy newspaper distribution
1.  people still buy physical newspapers?
2.  algorithms
3.  experiments to test the algorithms
4.  ...we need to modify the algorithms
5.  app architecture
YES!
at ~ 47,000 stores!
(And you should too.)
How many papers should we
deliver to each store each day?
Too many or too few: a waste of $$, or missed sales!
“Single copy” optimization
Single copy: the process
Weekly process
•  Stores report sales for 1-2 weeks ago (depending on the distributor)
•  We pick up the data via FTP, ingest them into our systems
•  Our models are retrained, predictions run
•  Predictions are handed off via FTP to the circulation department
Turnaround time ~ few hours
Single copy: the existing algorithm
Heuristics with many if/then statements
•  Highest sale over recent weeks × A + B
•  A, B are extremely hand-tuned by store type, location, ...
•  Interspersed amid 4600 lines of COBOL
Single copy: the existing algorithm
Heuristics with many if/then statements
•  Highest sale over recent weeks × A + B
•  A, B are extremely hand-tuned by store type, location, ...
•  Interspersed amid 4600 lines of COBOL
Single copy: the existing algorithm
Heuristics with many if/then statements
•  Highest sale over recent weeks × A + B
•  A, B are extremely hand-tuned by store type, location, ...
•  Interspersed amid 4600 lines of COBOL
•  Difficult to modify to include, e.g. print site cost differences
•  Quintessential time series modeling problem.
Perfect for data science!
Single copy newspaper distribution
1.  people still buy physical newspapers?
2.  algorithms
3.  experiments to test the algorithms
4.  ...we need to modify the algorithms
5.  app architecture
Algorithm components
The problem is separable into two parts:
Prediction: Given previous sales, how many papers will sell
next Thursday?
Policy: We think N papers will sell, with a known
uncertainty distribution. How many should we send (draw)?
First pass:
AR(1)
Xt = c + φ Xt-1 + εt
Daeil Kim
AR(1)
Prediction
•  Xt = c + φ Xt-1 + εt
•  Today’s sale is a linear function of last week(s)
•  One model per store per day of week
•  Use the past year’s data to fit for c, φ
•  AR(1) vs. AR(N) and training window chosen via cross-validation
Policy
•  Draw = ceil(demand)
•  Bump: if there have been recent sell-outs, send an extra
AR(1)
Implementation
•  Python 2, with statsmodels AR model. Single script.
•  Plots (matplotlib pngs) hosted using Flask to monitor draws & sales
•  Run by cron on a local server
•  No separate dev/prd environments; code “deployed” via scp
Second pass:
Poisson Regression
Dorian Goldman
Poisson Regression
Prediction
•  Today’s sale is a linear function of the previous
week(s) and the previous year
•  One model per store per day of week
•  Use the past year’s data to fit model
parameters
•  Feature time scales chosen via cross-validation
•  Assume the sales are drawn from a Poisson
distribution rather than Gaussian
•  Sell-outs considered in the likelihood function
Poisson Regression
b: # papers bought
d: # papers delivered (the draw)
z: demand (Poisson distributed latent variable)
λ: Poisson parameter for the demand distribution
Each store has a different λ each day. z for that store & day is drawn from a Poisson distribution with that λ.
Parameterize Poisson parameter λ as log-linear
function of features X.
θ are the parameters fitted in the problem via ML.
Poisson Regression
b: # papers bought
d: # papers delivered (the draw)
z: demand (Poisson distributed latent variable)
λ: Poisson parameter for the demand distribution
Probability of the # bought given the demand depends on if
the demand > papers delivered (i.e. if there was a sell-out)
Use this probability for a maximum likelihood
estimation of the parameters θ that describe λ
Poisson Regression
Policy: Newsvendor Algorithm
•  Profit = price × min(d, z) – cost × d
•  Take derivative of the profit, set it equal to zero, implies:
Probability(z <= d) = (price-cost)/price
•  Optimal draw: smallest integer such that
Probability(z <= d) >= (price-cost)/price
•  Probability given by the CDF of the Poisson distribution,
z = the demand prediction, brute force find best d.
z = demand
d = draw = # delivered
Poisson Regression
Implementation: refactored code!
•  Models abstracted to sklearn-like classes to allow for easy future
expansion with plug & play model integration
•  Common library of functions to:
•  get data from the DB
•  calculate costs
•  check data quality
•  ...
•  __init__()
•  query()
•  transform()
•  fit()
•  predict()
•  policy()
Single copy newspaper distribution
1.  people still buy physical newspapers?
2.  algorithms
3.  experiments to test the algorithms
4.  ...we need to modify the algorithms
5.  app architecture
Treatment & Control groups: match sales
Simple approach
•  Take a random sample that approximates the total sales distribution
•  For each member of this “treatment” sample, find closest match in mean sales
Trial & error checks!
•  Exclude cases with any large differences in sales during the training period
•  Only consider matches with the same production costs (~print site)
•  Make sure treatment & control sell the paper on the same weekdays
•  Better no match than a distant match
Reporting
D3 Dashboard
Optimize for profit: ✔
Make stakeholders happy: ✗
Our profit comes at the expense
of sales!
Sales matter beyond sales profit.
Circulation numbers matter.
Hard to quantify that value!
Goal: Optimize for profit
... but don’t decrease sales “too much”
∴ Constrained optimization
Single copy newspaper distribution
1.  people still buy physical newspapers?
2.  algorithms
3.  experiments to test the algorithms
4.  ...we need to modify the algorithms
5.  app architecture
Constrained newsvendor algorithm
Policy: Newsvendor Algorithm
•  Profit = price × min(d, z) – cost × d
Maximize profit – λ × sales (negative λ to boost sales)
Effectively modifies the sales price of the paper
•  (price – λ) × min(d, z) – cost × d
•  Optimal draw: smallest integer such that
Probability(z <= d) >= (price-λ-cost)/(price-λ)
Negative λ → increase effective sales price → worth sending extra papers
z = demand
d = draw = # delivered
The stakeholders choose λ
To our surprise, they chose λ such that
sales loss ~0 and profit was suboptimal.
But still much better than the original
algorithm!!
This tuneable knob is very handy; we run
experiments with different λs and the
stakeholders can make the final decisions
on which results are best.
Δ
| |
Reporting: model comparison
Look at both profit and sales differences between treatment & control
Leave trade-off decisions to the stakeholders: better for everyone.
Single copy newspaper distribution
1.  people still buy physical newspapers?
2.  algorithms
3.  experiments to test the algorithms
4.  ...we need to modify the algorithms
5.  app architecture
Current architecture: Google Cloud
App Engine: Web front end
App Engine Flex: Back ends for
reporting and predictions
BigQuery, Cloud Storage,
Cloud SQL: for hosting data and
configuration
Deployed via Drone
(github.com/NYTimes/drone-gae)
Github → Docker → GCR → AE Flex
Github → AE Standard
Architecture: Process
Data transfer
•  Weekly cron job per distributor, on AE instance
•  Taskqueue task: copy data from FTP to BQ, using config info in GCS
•  Task fails if the data are not there
•  The task queue retries every N minutes until the data shows up
Logging
•  Logs sent to Stackdriver, emails sent upon errors
•  Quality checks and progress messages sent to Slack
Architecture: Process
Reporting
•  Reads data from BQ
•  Calculates aggregations & stats about algorithm experiments, using
config info from CloudSQL (BQ & pandas)
•  Saves aggregated data back to BQ
•  Runs statistical tests on data quality (e.g. last week’s total sales within
3σ of previous mean), aborts if failure
•  Syncs the aggregated BQ tables with CloudSQL, for use in filtering the
front end UI
Architecture: Process
Predictions
•  Reads data from BQ
•  Retrains and predicts next week’s sales & how many papers to deliver
to each store each day (sklearn, scipy), using config info from CloudSQL
•  Saves results to GCS
•  Runs tests for unexpected changes in predictions, aborts if failure
Upload
•  The front end copies the results from GCS back to the FTP site
A well-distributed project
experiments
A/B testing algorithms
with $ directly as a KPI
communication
Fold qualitative business
concerns into the math
engineering
Google Cloud Platform
improves our process
algorithms
Sell-outs, costs
directly incorporated
Deploying Data Science for Distribution of The New York Times - Anne Bauer

More Related Content

What's hot

Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...MLconf
 
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016MLconf
 
BigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML, Inc
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In ProductionSamir Bessalah
 
MongoDB.local Seattle 2019: Advanced Schema Design Patterns
MongoDB.local Seattle 2019: Advanced Schema Design PatternsMongoDB.local Seattle 2019: Advanced Schema Design Patterns
MongoDB.local Seattle 2019: Advanced Schema Design PatternsMongoDB
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
VSSML16 L5. Basic Data Transformations
VSSML16 L5. Basic Data TransformationsVSSML16 L5. Basic Data Transformations
VSSML16 L5. Basic Data TransformationsBigML, Inc
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
 
BSSML17 - Deepnets
BSSML17 - DeepnetsBSSML17 - Deepnets
BSSML17 - DeepnetsBigML, Inc
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentationehtshamelahi
 
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI AI Frontiers
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016MLconf
 
Data Science With Python
Data Science With PythonData Science With Python
Data Science With PythonMosky Liu
 
BSSML16 L4. Association Discovery and Topic Modeling
BSSML16 L4. Association Discovery and Topic ModelingBSSML16 L4. Association Discovery and Topic Modeling
BSSML16 L4. Association Discovery and Topic ModelingBigML, Inc
 
VSSML17 L5. Basic Data Transformations and Feature Engineering
VSSML17 L5. Basic Data Transformations and Feature EngineeringVSSML17 L5. Basic Data Transformations and Feature Engineering
VSSML17 L5. Basic Data Transformations and Feature EngineeringBigML, Inc
 
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...BigML, Inc
 

What's hot (20)

Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
 
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
BigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML Summer 2016 Release
BigML Summer 2016 Release
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
MongoDB.local Seattle 2019: Advanced Schema Design Patterns
MongoDB.local Seattle 2019: Advanced Schema Design PatternsMongoDB.local Seattle 2019: Advanced Schema Design Patterns
MongoDB.local Seattle 2019: Advanced Schema Design Patterns
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Meta learning tutorial
Meta learning tutorialMeta learning tutorial
Meta learning tutorial
 
VSSML16 L5. Basic Data Transformations
VSSML16 L5. Basic Data TransformationsVSSML16 L5. Basic Data Transformations
VSSML16 L5. Basic Data Transformations
 
Learning how to learn
Learning how to learnLearning how to learn
Learning how to learn
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
BSSML17 - Deepnets
BSSML17 - DeepnetsBSSML17 - Deepnets
BSSML17 - Deepnets
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
 
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
 
Data Science With Python
Data Science With PythonData Science With Python
Data Science With Python
 
BSSML16 L4. Association Discovery and Topic Modeling
BSSML16 L4. Association Discovery and Topic ModelingBSSML16 L4. Association Discovery and Topic Modeling
BSSML16 L4. Association Discovery and Topic Modeling
 
VSSML17 L5. Basic Data Transformations and Feature Engineering
VSSML17 L5. Basic Data Transformations and Feature EngineeringVSSML17 L5. Basic Data Transformations and Feature Engineering
VSSML17 L5. Basic Data Transformations and Feature Engineering
 
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
 

Similar to Deploying Data Science for Distribution of The New York Times - Anne Bauer

Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ FyberDaniel Hen
 
Advanced Schema Design Patterns
Advanced Schema Design PatternsAdvanced Schema Design Patterns
Advanced Schema Design PatternsMongoDB
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
 
Machine Learning in Production
Machine Learning in ProductionMachine Learning in Production
Machine Learning in ProductionBen Freundorfer
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer visionEran Shlomo
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning SystemsAnuj Gupta
 
Pragmatic Machine Learning @ ML Spain
Pragmatic Machine Learning @ ML SpainPragmatic Machine Learning @ ML Spain
Pragmatic Machine Learning @ ML SpainLouis Dorard
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionJaroslaw Szymczak
 
MongoDB.local Dallas 2019: Advanced Schema Design Patterns
MongoDB.local Dallas 2019: Advanced Schema Design PatternsMongoDB.local Dallas 2019: Advanced Schema Design Patterns
MongoDB.local Dallas 2019: Advanced Schema Design PatternsMongoDB
 
SparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingSparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingDatabricks
 
UNIT_5_Data Wrangling.pptx
UNIT_5_Data Wrangling.pptxUNIT_5_Data Wrangling.pptx
UNIT_5_Data Wrangling.pptxBhagyasriPatel2
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.pptArumugam90
 
Thomas Jensen. Machine Learning
Thomas Jensen. Machine LearningThomas Jensen. Machine Learning
Thomas Jensen. Machine LearningVolha Banadyseva
 
How we integrate Machine Learning Algorithms into our IT Platform at Outfittery
How we integrate Machine Learning Algorithms into our IT Platform at OutfitteryHow we integrate Machine Learning Algorithms into our IT Platform at Outfittery
How we integrate Machine Learning Algorithms into our IT Platform at OutfitteryOUTFITTERY
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditMichael BENESTY
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analyticsAnirudh
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganSpark Summit
 
House price prediction
House price predictionHouse price prediction
House price predictionKaranseth30
 

Similar to Deploying Data Science for Distribution of The New York Times - Anne Bauer (20)

Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Advanced Schema Design Patterns
Advanced Schema Design PatternsAdvanced Schema Design Patterns
Advanced Schema Design Patterns
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Machine Learning in Production
Machine Learning in ProductionMachine Learning in Production
Machine Learning in Production
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 
Pragmatic Machine Learning @ ML Spain
Pragmatic Machine Learning @ ML SpainPragmatic Machine Learning @ ML Spain
Pragmatic Machine Learning @ ML Spain
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
MongoDB.local Dallas 2019: Advanced Schema Design Patterns
MongoDB.local Dallas 2019: Advanced Schema Design PatternsMongoDB.local Dallas 2019: Advanced Schema Design Patterns
MongoDB.local Dallas 2019: Advanced Schema Design Patterns
 
SparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingSparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time Bidding
 
UNIT_5_Data Wrangling.pptx
UNIT_5_Data Wrangling.pptxUNIT_5_Data Wrangling.pptx
UNIT_5_Data Wrangling.pptx
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
 
Thomas Jensen. Machine Learning
Thomas Jensen. Machine LearningThomas Jensen. Machine Learning
Thomas Jensen. Machine Learning
 
How we integrate Machine Learning Algorithms into our IT Platform at Outfittery
How we integrate Machine Learning Algorithms into our IT Platform at OutfitteryHow we integrate Machine Learning Algorithms into our IT Platform at Outfittery
How we integrate Machine Learning Algorithms into our IT Platform at Outfittery
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax audit
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
 
House price prediction
House price predictionHouse price prediction
House price prediction
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...PyData
 
Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...PyData
 
Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...
 
Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...
 
Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...
 

Recently uploaded

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Deploying Data Science for Distribution of The New York Times - Anne Bauer

  • 1. Deploying Data Science for Distribution of Anne Bauer anne.bauer@nytimes.com Lead Data Scientist, NYTimes PyData 20181017
  • 2. Single copy newspaper distribution 1.  people still buy physical newspapers? 2.  algorithms 3.  experiments to test the algorithms 4.  ...we need to modify the algorithms 5.  app architecture
  • 3. Single copy newspaper distribution 1.  people still buy physical newspapers? 2.  algorithms 3.  experiments to test the algorithms 4.  ...we need to modify the algorithms 5.  app architecture
  • 4. YES! at ~ 47,000 stores! (And you should too.)
  • 5. How many papers should we deliver to each store each day? Too many or too few: a waste of $$, or missed sales! “Single copy” optimization
  • 6. Single copy: the process Weekly process •  Stores report sales for 1-2 weeks ago (depending on the distributor) •  We pick up the data via FTP, ingest them into our systems •  Our models are retrained, predictions run •  Predictions are handed off via FTP to the circulation department Turnaround time ~ few hours
  • 7. Single copy: the existing algorithm Heuristics with many if/then statements •  Highest sale over recent weeks × A + B •  A, B are extremely hand-tuned by store type, location, ... •  Interspersed amid 4600 lines of COBOL
  • 8. Single copy: the existing algorithm Heuristics with many if/then statements •  Highest sale over recent weeks × A + B •  A, B are extremely hand-tuned by store type, location, ... •  Interspersed amid 4600 lines of COBOL
  • 9. Single copy: the existing algorithm Heuristics with many if/then statements •  Highest sale over recent weeks × A + B •  A, B are extremely hand-tuned by store type, location, ... •  Interspersed amid 4600 lines of COBOL •  Difficult to modify to include, e.g. print site cost differences •  Quintessential time series modeling problem. Perfect for data science!
  • 10. Single copy newspaper distribution 1.  people still buy physical newspapers? 2.  algorithms 3.  experiments to test the algorithms 4.  ...we need to modify the algorithms 5.  app architecture
  • 11. Algorithm components The problem is separable into two parts: Prediction: Given previous sales, how many papers will sell next Thursday? Policy: We think N papers will sell, with a known uncertainty distribution. How many should we send (draw)?
  • 12. First pass: AR(1) Xt = c + φ Xt-1 + εt Daeil Kim
  • 13. AR(1) Prediction •  Xt = c + φ Xt-1 + εt •  Today’s sale is a linear function of last week(s) •  One model per store per day of week •  Use the past year’s data to fit for c, φ •  AR(1) vs. AR(N) and training window chosen via cross-validation Policy •  Draw = ceil(demand) •  Bump: if there have been recent sell-outs, send an extra
  • 14. AR(1) Implementation •  Python 2, with statsmodels AR model. Single script. •  Plots (matplotlib pngs) hosted using Flask to monitor draws & sales •  Run by cron on a local server •  No separate dev/prd environments; code “deployed” via scp
  • 16. Poisson Regression Prediction •  Today’s sale is a linear function of the previous week(s) and the previous year •  One model per store per day of week •  Use the past year’s data to fit model parameters •  Feature time scales chosen via cross-validation •  Assume the sales are drawn from a Poisson distribution rather than Gaussian •  Sell-outs considered in the likelihood function
  • 17. Poisson Regression b: # papers bought d: # papers delivered (the draw) z: demand (Poisson distributed latent variable) λ: Poisson parameter for the demand distribution Each store has a different λ each day. z for that store & day is drawn from a Poisson distribution with that λ. Parameterize Poisson parameter λ as log-linear function of features X. θ are the parameters fitted in the problem via ML.
  • 18. Poisson Regression b: # papers bought d: # papers delivered (the draw) z: demand (Poisson distributed latent variable) λ: Poisson parameter for the demand distribution Probability of the # bought given the demand depends on if the demand > papers delivered (i.e. if there was a sell-out) Use this probability for a maximum likelihood estimation of the parameters θ that describe λ
  • 19. Poisson Regression Policy: Newsvendor Algorithm •  Profit = price × min(d, z) – cost × d •  Take derivative of the profit, set it equal to zero, implies: Probability(z <= d) = (price-cost)/price •  Optimal draw: smallest integer such that Probability(z <= d) >= (price-cost)/price •  Probability given by the CDF of the Poisson distribution, z = the demand prediction, brute force find best d. z = demand d = draw = # delivered
  • 20. Poisson Regression Implementation: refactored code! •  Models abstracted to sklearn-like classes to allow for easy future expansion with plug & play model integration •  Common library of functions to: •  get data from the DB •  calculate costs •  check data quality •  ... •  __init__() •  query() •  transform() •  fit() •  predict() •  policy()
  • 21. Single copy newspaper distribution 1.  people still buy physical newspapers? 2.  algorithms 3.  experiments to test the algorithms 4.  ...we need to modify the algorithms 5.  app architecture
  • 22. Treatment & Control groups: match sales Simple approach •  Take a random sample that approximates the total sales distribution •  For each member of this “treatment” sample, find closest match in mean sales Trial & error checks! •  Exclude cases with any large differences in sales during the training period •  Only consider matches with the same production costs (~print site) •  Make sure treatment & control sell the paper on the same weekdays •  Better no match than a distant match
  • 23. Reporting D3 Dashboard Optimize for profit: ✔ Make stakeholders happy: ✗ Our profit comes at the expense of sales! Sales matter beyond sales profit. Circulation numbers matter. Hard to quantify that value!
  • 24. Goal: Optimize for profit ... but don’t decrease sales “too much” ∴ Constrained optimization
  • 25. Single copy newspaper distribution 1.  people still buy physical newspapers? 2.  algorithms 3.  experiments to test the algorithms 4.  ...we need to modify the algorithms 5.  app architecture
  • 26. Constrained newsvendor algorithm Policy: Newsvendor Algorithm •  Profit = price × min(d, z) – cost × d Maximize profit – λ × sales (negative λ to boost sales) Effectively modifies the sales price of the paper •  (price – λ) × min(d, z) – cost × d •  Optimal draw: smallest integer such that Probability(z <= d) >= (price-λ-cost)/(price-λ) Negative λ → increase effective sales price → worth sending extra papers z = demand d = draw = # delivered
  • 27. The stakeholders choose λ To our surprise, they chose λ such that sales loss ~0 and profit was suboptimal. But still much better than the original algorithm!! This tuneable knob is very handy; we run experiments with different λs and the stakeholders can make the final decisions on which results are best. Δ | |
  • 28. Reporting: model comparison Look at both profit and sales differences between treatment & control Leave trade-off decisions to the stakeholders: better for everyone.
  • 29. Single copy newspaper distribution 1.  people still buy physical newspapers? 2.  algorithms 3.  experiments to test the algorithms 4.  ...we need to modify the algorithms 5.  app architecture
  • 30. Current architecture: Google Cloud App Engine: Web front end App Engine Flex: Back ends for reporting and predictions BigQuery, Cloud Storage, Cloud SQL: for hosting data and configuration Deployed via Drone (github.com/NYTimes/drone-gae) Github → Docker → GCR → AE Flex Github → AE Standard
  • 31. Architecture: Process Data transfer •  Weekly cron job per distributor, on AE instance •  Taskqueue task: copy data from FTP to BQ, using config info in GCS •  Task fails if the data are not there •  The task queue retries every N minutes until the data shows up Logging •  Logs sent to Stackdriver, emails sent upon errors •  Quality checks and progress messages sent to Slack
  • 32. Architecture: Process Reporting •  Reads data from BQ •  Calculates aggregations & stats about algorithm experiments, using config info from CloudSQL (BQ & pandas) •  Saves aggregated data back to BQ •  Runs statistical tests on data quality (e.g. last week’s total sales within 3σ of previous mean), aborts if failure •  Syncs the aggregated BQ tables with CloudSQL, for use in filtering the front end UI
  • 33. Architecture: Process Predictions •  Reads data from BQ •  Retrains and predicts next week’s sales & how many papers to deliver to each store each day (sklearn, scipy), using config info from CloudSQL •  Saves results to GCS •  Runs tests for unexpected changes in predictions, aborts if failure Upload •  The front end copies the results from GCS back to the FTP site
  • 34. A well-distributed project experiments A/B testing algorithms with $ directly as a KPI communication Fold qualitative business concerns into the math engineering Google Cloud Platform improves our process algorithms Sell-outs, costs directly incorporated