This document summarizes the key steps and outcomes of a project to build an end-to-end recommendation system for a power utility company. The system was designed to integrate machine learning models with mobile and call center systems to recommend ancillary products to customers. The project involved exploring customer data, developing machine learning models through an iterative process, and operationalizing the models by building APIs and automated workflows. The new system provided recommendations via microservices and represented an improvement over the utility's previous manual, less rigorous approach to data science and modeling.
4. Context
End to end Recommendation System, from data to insights
Power utility company
seeking to build and end to
end recommendation
systems for ancillary products
which will integrate with a
mobile app and call center
systems
● Machine learning techniques
and rich data to build models
to recommend products
● Microservices based
architecture to integrate data
science results into mobile
app and call center systems
● Agile development practices
to build high quality software
✓ End to end product
recommendation solution
✓ Model results exposed via API
✓ Enablement of Data Science
team
Customer Solution Outcome
5. Technology and Data Overview
● Electric charges
● Account
● Demographic data (Acxiom)
● Product eligibility
● Product participation
● 6.5+ Million Customers
● 150+ Million rows
Data Sources
Tools
Platform
6. Agile Data Science
Pair Programming
Retros
Test Driven Development
Continuous Integration /
API First
Tracker
Standups
7. Agile Data Science
Discovery
Phase
✓ Data exploration for
understanding context of the
data and its business
implications
✓ Data cleansing, transformation
and feature engineering
✓ Training, validation and
evaluation of ML algorithms
✓ Multiple iteration of above steps
to get the desired model
performance
Operationalization (O16n)
Phase
✓ Test driven development of data
cleansing and feature
engineering scripts
✓ Setup automatic data pipelines
to clean, cleanse and score new
data
✓ Setup monitoring code to check
incoming data to identify
remodeling efforts
✓ Build APIs to consume model
output
8. End to End
Data exploration ,
feature generation
and ad-hoc ML
modeling
Use test driven
development
(TDD) to create
production quality
pyspark scripts
Build an automated
scoring workflow
using pyspark
scripts to generate
recommendations
TDD
Recommendation
microservice on
Pivotal Cloud Foundry
to server customer
recommendations
9. Discovery Phase : Data Exploration
Worked with subject matter experts (SMEs) to
understand how the data is generated, how the
data is used and business implications of data
Takeaways
● Context and business impact of data gained here is very valuable to eventual
success of the machine learning model
● There might be resistance from stakeholders for such activity (“not real work”)
● Mitigate this resistance by sharing the data exploration insights and their
business implications
10. Discovery Phase : Feature Engineering
● Our goal was to predict the propensity of a customer to buy a
particular ancillary product
● We only had information when a customer bought the product
● We did not have any solicitation history
● We took all the buy events and calculated features for that event
with a backward looking window for our +ve examples
● We sampled -ve events randomly and calculated features using the
same backward looking window Time
Buy Event
Window for features
Takeaways
● Setting up data to run machine learning algorithm is more of an art than science
● Balance +ve and -ve examples especially for rare events
● Be aware of biases that may affect data, these biases have modeling implications
11. Discovery Phase : ML modeling iterations
Takeaways
● Getting feedback from SMEs on the model results is very important
● Sharing impactful features a great way to get feedback and build
SME trust in ML models
● Figure alongside show the ML model iteration process
● We tried many algorithm with various hyper parameters
● Elastic net models were the most viable models and were
chosen to deploy during operationalization phase
12. O16n : Production scripts using TDD
After the discovery phase, we used TDD to write production scripts
for data cleansing, feature generation and model scoring
13. Why Paring and TDD?
Pair Programming
Test Driven Development
“Time spent writing a test beforehand is rarely wasted. Code written to pass
a test takes much less time to debug.” – client 1
“TDD gives me the confidence that I won’t commit code that breaks existing
functionality, no matter what I change” – client 2
“Pairing instills critical thinking, builds confidence, distributes knowledge, and
gets work done. Most methods of work only do one of those things.” – client 1
“Pairing was an educational experience for me, as well as a real-time validator.
If my pair catches a problem with my code, I’ll know about it in real time.”
– client 2
14. End to End
Data exploration ,
feature generation
and ad-hoc ML
modeling
Use test driven
development
(TDD) to create
production quality
pyspark scripts
Build an automated
scoring workflow
using pyspark
scripts to generate
recommendations
TDD
Recommendation
microservice on
Pivotal Cloud Foundry
to server customer
recommendations
15. Summary of Enablement
● Ad-hoc model building in SAS
enterprise miner
● Minimal data science rigor
● Manual data upload to SAS
environment for modeling
● Model results shared using
Excel
● Results used only for
forecasting
Before
✓ Data science on modern open
source tools
✓ Data science rigor
✓ Automated workflow for data
cleansing, feature generation
and scoring
✓ Robust logging and validation
of data and model results
✓ Recommendation microservice
up in production to be
consumed by app developers
After