Automating Prediction Model Life-cycles with Spark

A Spark-stack for Automating Life-
cycle of Prediction Models
© 2017 24/7 Customer, Inc. All rights reserved.
Monday, November 13, 2017
Samik Raychaudhuri, Ph.D.
Director, Data Science Group
[24]7.ai Innovation Labs
Bangalore Apache Spark Meetup
Nov 2017

Agenda
• Introduction
• Use cases for ML models at [24]7
• Model management at [24]7
• The Spark Stack
• Conclusion

Prediction Models in [24]7.ai

About [24]7
• [24]7 is a software company based out of Bay area,
US and Bangalore, India, delivering customer
support solutions enhanced by predictive
technologies
• Using predictive models to drive enhanced customer
experience is an emerging and niche area of
application of analytics and big data
• Our machine learning models on big data predict the
customer intent across various touchpoints in real
time, helping us provide an intuitive experience
when the customers (of our clients) contact us

2.5B
Digital Interactions/Year
4.5TB
Interaction Data/Week
90%+
CSAT across channels
100M
Visitors/Year
1st
True Multi-modal Solution
1st
Omni-channel Solution
We deliver a cloud-based software platform that uses
predictive analytics and big data to make company-to-
consumer connections intuitive.
[24]7 - World’s Largest Self-Service Network

Assist (for Chat)
Smart chat platform for online and
mobile engagement
Assist (for IVR)
Call deflection to mobile web chat for
higher NPS and ROI
Assist (for Voice)
Smart voice agent platform for multi-
modal engagement of voice callers
SELF
SERVICE
PRODUCTS
ASSISTED
SERVICE
PRODUCTS
© 2014. 24/7 Customer, INC. All rights reserved. CONFIDENTIAL
Predictive Sales
Drive higher incremental revenue and customer
acquisition
Predictive Service
Reduce customer effort to increase CSAT and NPS in
customer service
Chat Agents
Chat agent services that engage customers and help
reduce costs, generate revenue, and improve CSAT
Voice Agents
Voice agent services that engage customers and help
reduce costs, generate revenue, and improve CSAT
SOLUTIONS
SERVICES
Social
Social sharing
Mobile
Mobile self-service
Vivid Speech
Mobile for IVR
Speech
Speech self-service IVR
[24]7 iLabs: A Quick Snapshot

Data Science – What it means for [24]7
fn (Customer type,
location, Identity, interaction
context, journey, behavior …)
Intent: Purchase;
issue with product or
service, …
Customer Intent Engine
Intent Models
fn (Identity, ntent type,
history, channel affinity,
customer value…)
Measure: usage,
containment, repeat…
Engagement Engine
Guided
self-
service
“”
Cha
t
Phon
e
Sales
Resolution
Experience
Retention
Metrics: conversion
rate, revenue, CSAT,
…
Outcomes
Machine Learning
At Scale
Creating Personalized Intuitive Consumer Experiences

Big Data in [24]7
Data Sources Technologies

Use case of intent prediction: Web visits
• For our clients in the retail vertical, we provide chat
agents who are experienced in providing differentiated
support
• The differentiation is based on:
• Current phase of the journey
• Specific persona of the visitor
• We use ML models to compute probabilities of various
intents, and use them to provide customized intervention
for sales and service journeys

Use case of intent prediction: IVR Calls
• For our clients in banking, our IVR platform provide self-
service options for service journeys
• The challenge is to resolve the issues faced by the
customer within the IVR platform itself
• One of our flagship offering is our natural language
understanding engine from free-flowing response
• Again, we use ML models to compute probabilities of
various intents from the response, and use them to
provide specific service or transfer to a voice agent
alongwith context

Use case of intent prediction: within Chat
• An emerging use case is deploying AI-assisted Virtual
Agents (chatbots) for verious enterprise use cases
• The challenges here are:
• To detect intent from natural language texts, and then provide
natural language response – essentially continue a natural
conversation
• To be able to bring in human agents when the conversation goes
out-of-scope for the VA.
• We are using ML models to detect intent and state from
the conversation and take appropriate action

Technology and Model Management
at [24]7

High Level Architecture
© 2017 24/7 Customer, Inc. All rights reserved. 13
Events Real Time
Platform
Batch Data
Platform
Events
Reporting
and BI
Predictions
Models

[24]7 Big Data Platform: Technologies
• We use multiple open-source technologies to power our platform.
Some of the technologies in use:
• Real Time Platform
• Apache Cassandra ring [http://cassandra.apache.org/]
• Jetty server for execution [http://www.eclipse.org/jetty/]
• Batch Data Platform
• Apache Hadoop [http://hadoop.apache.org/]
• Apache Hive [http://hive.apache.org/]
• Apache Spark [http://spark.apache.org/] [Upcoming]
• Others
• Apache Kafka [http://kafka.apache.org/]
• Apache Avro [http://avro.apache.org/]
• HP Vertica database [http://www.vertica.com/]
• Apache Pig [http://pig.apache.org/]
• Apache Druid [https://druid.apache.org/]

Architecture for model building
Events
Batch Data Platform
HDFS
Nightly MR Jobs
Structured Datamart
Regular Model
Building
Model Management Platform
Analytics & Monitoring
Retraining
R&D Model
Building
Deploy Trained Model

Model building workflow
Sign Contract
Data Requirement Gathering
Data Capture
Exploratory Data Analysis
Model Building
Simulation
Model Deployment
Monitoring and Retraining

Platform for Model Management – Why?
• Prediction models are one of the key piece to achieve
targets set in the contract, however it is part of a larger
workflow – needs standardization
• Standard transformations: We now support a set of standard
transformations, coded in the same standard way in any model
• Standard libraries: Different libraries in different software
ecosystem (e.g., R, Python, Spark ML etc.) produce slightly
different result. With this platform, we can compare models, or
select one runtime to deploy models
• Skill can become an issue when working on prediction models
for various clients – the platform takes skill out of the equation by
providing templates encoding best practices

Spark Stack for Model Management

Early Iteration for Model Management Platform
• Model management platform was originally built on top
of Vertica
• Vertica from HP (now MicroFocus) is a columnar database with
strong analytical query capabilities
• We loaded the output of MR jobs in Vertica, which acted as our
datamart
• Model training workflow was managed by Oozie
• The actual job of training models were performed in the Vertica
cluster using Vertica UDF’s written in C++ and R

Early Iteration for Model Management Platform
Events
Batch Data Platform
HDFS
Nightly MR Jobs
Structured Datamart: Vertica
Regular Model
Building
Model Management Platform:
Vertica UDFs + Oozie workflows
Retraining
R&D Model
Building

Pros and Cons of using Vertica
• Pros
• All the EDA and computations happened in-database, thus there were no
substantial data movement for model building
• Vertica supports SQL and R, thus resulting in easy onboarding for analysts
and data scientists
• Custom code for feature engineering from existing columns
• Cons
• Speed of computation was limited by the cluster size of Vertica
• R UDFs cannot be parallelized, thereby limiting the amount of distributed
computations that can be done while training complex models
• In some cases, hard to maintain or find R libraries compatible with Vertica
• Compatibility issues in general
• Small community of developers
• Cumbersome model deployment
• License requirement vs existing spark cluster

Moving to Spark
• Spark is a strong distributed computation engine with huge
community supporting it
• It is general purpose, helping to deploy scripts/codes for data
preparation as well as monitoring
• SparkML has matured with lots of features, quick bug fixes
and (again) active community
• We wanted to expand model building to more use-cases, and
the required data were already available in HDFS
• Spark models can be directly deployed on our production JVM
stack
• We already had a Spark cluster which was getting used for
ad-hoc queries
• Eliminates the need of specific feature engineering by using
hashing tricks

Model Management Platform with Spark
Events
Batch Data Platform
HDFS
Structured Datamart:
HDFS/Vertica
Regular Model
Building
Model Management Platform:
Spark Cluster
Retraining
R&D Model
Building
Nightly Jobs (MR+Spark)

Developing the Framework
• The framework is a wrapper around spark libraries
developed in-house in Scala
• Has specialized modules to manage:
• Provision for config reading and validation
• Provision for reading data from HDFS (through Hive) and Vertica
• Provision for output (models) to be available as both bytecode
and as other (legacy) formats
• Provision for supporting custom model training workflows
including post-processing
• API for accessing individual functionality
• Needed around 8-9 man-months to complete the project

HashingTF in SparkML
• HashingTF is a way of automated feature engineering
from textual data using hashing trick
• Essentially, using this method, one can project text to a large
multidimensional space, thereby capturing nuanced features
UTF-8
Encoding
hashBytes
Byte to Int conversion
Multiply/Rotate/
Add/Shift/XOR
Mixing
Constants
Hashed Value
Index ScalingNumber of
Features
TF Computation
HashingTF vector
Array of Features

Using HashingTF
• Using hashingTF can replace multiple preprocessing
steps for ML model training:
• Dealing with categorical variables
• Custom feature extraction (e.g., using regular expression) from
text data
• Example: Categorizing URL’s
• In our comparison experiments, we have noticed similar
or better results from models using hashingTF vs models
developed the traditional way
• Effect was more prominent when the original model included
multiple custom-created feature from large amount of text

Other Benefits of using Spark
• Model training is much faster compared to the
legacy method
• We are able to use distributed computation among the
nodes
• For a model trained on 1M rows, we see 2x-5x
improvement
• Innovative deployment of production models
• Uses a mix of javascript code and java byte-serialized
code for a DAG of models
• Complex models in spark format (byte-serialized) runs
faster
• Faster cycle of model training, testing and
deployment as the same underlying infrastructure is
used

Future work on the platform
• We are exploring training of other complex
models on the spark platform
• Deep learning models for chatbot conversations
using MXNET
• We have worked on some innovations in
sampling, solving optimization problems and
training svm models in the spark library
• Would like to share those with the spark community

Questions

Automating Prediction Model Life-cycles with Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Automating Prediction Model Life-cycles with Spark

Similar to Automating Prediction Model Life-cycles with Spark (20)

More from datamantra

More from datamantra (20)

Recently uploaded

Recently uploaded (20)

Automating Prediction Model Life-cycles with Spark