SlideShare a Scribd company logo
1 of 29
Download to read offline
A Spark-stack for Automating Life-
cycle of Prediction Models
© 2017 24/7 Customer, Inc. All rights reserved.
Monday, November 13, 2017
Samik	Raychaudhuri,	Ph.D.
Director,	Data	Science	Group
[24]7.ai	Innovation	Labs
Bangalore	Apache	Spark	Meetup
Nov	2017
Agenda
© 2017 24/7 Customer, Inc. All rights reserved.
• Introduction
• Use cases for ML models at [24]7
• Model management at [24]7
• The Spark Stack
• Conclusion
Prediction Models in [24]7.ai
Monday, November 13, 2017
© 2017 24/7 Customer, Inc. All rights reserved.
About [24]7
© 2017 24/7 Customer, Inc. All rights reserved.
• [24]7 is a software company based out of Bay area,
US and Bangalore, India, delivering customer
support solutions enhanced by predictive
technologies
• Using predictive models to drive enhanced customer
experience is an emerging and niche area of
application of analytics and big data
• Our machine learning models on big data predict the
customer intent across various touchpoints in real
time, helping us provide an intuitive experience
when the customers (of our clients) contact us
2.5B
Digital Interactions/Year
4.5TB
Interaction Data/Week
90%+
CSAT across channels
100M
Visitors/Year
1st
True Multi-modal Solution
1st
Omni-channel Solution
We deliver a cloud-based software platform that uses
predictive analytics and big data to make company-to-
consumer connections intuitive.
[24]7 - World’s Largest Self-Service Network
© 2017 24/7 Customer, Inc. All rights reserved.
Assist (for Chat)
Smart chat platform for online and
mobile engagement
Assist (for IVR)
Call deflection to mobile web chat for
higher NPS and ROI
Assist (for Voice)
Smart voice agent platform for multi-
modal engagement of voice callers
SELF	
SERVICE
PRODUCTS
ASSISTED	
SERVICE	
PRODUCTS
© 2014. 24/7 Customer, INC. All rights reserved. CONFIDENTIAL
Predictive Sales
Drive higher incremental revenue and customer
acquisition
Predictive Service
Reduce customer effort to increase CSAT and NPS in
customer service
Chat Agents
Chat agent services that engage customers and help
reduce costs, generate revenue, and improve CSAT
Voice Agents
Voice agent services that engage customers and help
reduce costs, generate revenue, and improve CSAT
SOLUTIONS
SERVICES
Social
Social sharing
Mobile
Mobile self-service
Vivid Speech
Mobile for IVR
Speech
Speech self-service IVR
[24]7 iLabs: A Quick Snapshot
Data Science – What it means for [24]7
© 2017 24/7 Customer, Inc. All rights reserved.
fn (Customer type,
location, Identity, interaction
context, journey, behavior …)
Intent: Purchase;
issue with product or
service, …
Customer Intent Engine
Intent Models
fn (Identity, ntent type,
history, channel affinity,
customer value…)
Measure: usage,
containment, repeat…
Engagement Engine
Guided
self-
service
“”
Cha
t
Phon
e
Sales
Resolution
Experience
Retention
Metrics: conversion
rate, revenue, CSAT,
…
Outcomes
Machine Learning
At Scale
Creating Personalized Intuitive Consumer Experiences
Big Data in [24]7
© 2017 24/7 Customer, Inc. All rights reserved.
Data	Sources Technologies
Use case of intent prediction: Web visits
© 2017 24/7 Customer, Inc. All rights reserved.
• For our clients in the retail vertical, we provide chat
agents who are experienced in providing differentiated
support
• The differentiation is based on:
• Current phase of the journey
• Specific persona of the visitor
• We use ML models to compute probabilities of various
intents, and use them to provide customized intervention
for sales and service journeys
Use case of intent prediction: IVR Calls
© 2017 24/7 Customer, Inc. All rights reserved.
• For our clients in banking, our IVR platform provide self-
service options for service journeys
• The challenge is to resolve the issues faced by the
customer within the IVR platform itself
• One of our flagship offering is our natural language
understanding engine from free-flowing response
• Again, we use ML models to compute probabilities of
various intents from the response, and use them to
provide specific service or transfer to a voice agent
alongwith context
Use case of intent prediction: within Chat
© 2017 24/7 Customer, Inc. All rights reserved.
• An emerging use case is deploying AI-assisted Virtual
Agents (chatbots) for verious enterprise use cases
• The challenges here are:
• To detect intent from natural language texts, and then provide
natural language response – essentially continue a natural
conversation
• To be able to bring in human agents when the conversation goes
out-of-scope for the VA.
• We are using ML models to detect intent and state from
the conversation and take appropriate action
Technology and Model Management
at [24]7
Monday, November 13, 2017
© 2017 24/7 Customer, Inc. All rights reserved.
High Level Architecture
© 2017 24/7 Customer, Inc. All rights reserved. 13
Events Real	Time	
Platform
Batch	Data	
Platform
Events
Reporting	
and	BI
Predictions
Models
[24]7 Big Data Platform: Technologies
© 2017 24/7 Customer, Inc. All rights reserved.
• We use multiple open-source technologies to power our platform.
Some of the technologies in use:
• Real Time Platform
• Apache Cassandra ring [http://cassandra.apache.org/]
• Jetty server for execution [http://www.eclipse.org/jetty/]
• Batch Data Platform
• Apache Hadoop [http://hadoop.apache.org/]
• Apache Hive [http://hive.apache.org/]
• Apache Spark [http://spark.apache.org/] [Upcoming]
• Others
• Apache Kafka [http://kafka.apache.org/]
• Apache Avro [http://avro.apache.org/]
• HP Vertica database [http://www.vertica.com/]
• Apache Pig [http://pig.apache.org/]
• Apache Druid [https://druid.apache.org/]
Architecture for model building
© 2017 24/7 Customer, Inc. All rights reserved. 15
Events
Batch	Data	Platform
HDFS
Nightly	MR	Jobs
Structured	Datamart
Regular	Model	
Building
Model	Management	Platform
Analytics	&	Monitoring
Retraining
R&D	Model	
Building
Deploy	Trained	Model
Model building workflow
© 2017 24/7 Customer, Inc. All rights reserved.
Sign	Contract
Data	Requirement	Gathering
Data	Capture
Exploratory	Data	Analysis
Model	Building
Simulation
Model	Deployment
Monitoring	and	Retraining
Platform for Model Management – Why?
© 2017 24/7 Customer, Inc. All rights reserved.
• Prediction models are one of the key piece to achieve
targets set in the contract, however it is part of a larger
workflow – needs standardization
• Standard transformations: We now support a set of standard
transformations, coded in the same standard way in any model
• Standard libraries: Different libraries in different software
ecosystem (e.g., R, Python, Spark ML etc.) produce slightly
different result. With this platform, we can compare models, or
select one runtime to deploy models
• Skill can become an issue when working on prediction models
for various clients – the platform takes skill out of the equation by
providing templates encoding best practices
Spark Stack for Model Management
Monday, November 13, 2017
© 2017 24/7 Customer, Inc. All rights reserved.
Early Iteration for Model Management Platform
© 2017 24/7 Customer, Inc. All rights reserved.
• Model management platform was originally built on top
of Vertica
• Vertica from HP (now MicroFocus) is a columnar database with
strong analytical query capabilities
• We loaded the output of MR jobs in Vertica, which acted as our
datamart
• Model training workflow was managed by Oozie
• The actual job of training models were performed in the Vertica
cluster using Vertica UDF’s written in C++ and R
Early Iteration for Model Management Platform
© 2017 24/7 Customer, Inc. All rights reserved. 20
Events
Batch	Data	Platform
HDFS
Nightly	MR	Jobs
Structured	Datamart:	Vertica
Regular	Model	
Building
Model	Management	Platform:	
Vertica	UDFs	+	Oozie workflows
Analytics	&	Monitoring
Retraining
R&D	Model	
Building
Deploy	Trained	Model
Pros and Cons of using Vertica
© 2017 24/7 Customer, Inc. All rights reserved.
• Pros
• All the EDA and computations happened in-database, thus there were no
substantial data movement for model building
• Vertica supports SQL and R, thus resulting in easy onboarding for analysts
and data scientists
• Custom code for feature engineering from existing columns
• Cons
• Speed of computation was limited by the cluster size of Vertica
• R UDFs cannot be parallelized, thereby limiting the amount of distributed
computations that can be done while training complex models
• In some cases, hard to maintain or find R libraries compatible with Vertica
• Compatibility issues in general
• Small community of developers
• Cumbersome model deployment
• License requirement vs existing spark cluster
Moving to Spark
© 2017 24/7 Customer, Inc. All rights reserved.
• Spark is a strong distributed computation engine with huge
community supporting it
• It is general purpose, helping to deploy scripts/codes for data
preparation as well as monitoring
• SparkML has matured with lots of features, quick bug fixes
and (again) active community
• We wanted to expand model building to more use-cases, and
the required data were already available in HDFS
• Spark models can be directly deployed on our production JVM
stack
• We already had a Spark cluster which was getting used for
ad-hoc queries
• Eliminates the need of specific feature engineering by using
hashing tricks
Model Management Platform with Spark
© 2017 24/7 Customer, Inc. All rights reserved. 23
Events
Batch	Data	Platform
HDFS
Structured	Datamart:	
HDFS/Vertica
Regular	Model	
Building
Model	Management	Platform:	
Spark	Cluster
Analytics	&	Monitoring
Retraining
R&D	Model	
Building
Deploy	Trained	Model
Nightly	Jobs	(MR+Spark)
Developing the Framework
© 2017 24/7 Customer, Inc. All rights reserved.
• The framework is a wrapper around spark libraries
developed in-house in Scala
• Has specialized modules to manage:
• Provision for config reading and validation
• Provision for reading data from HDFS (through Hive) and Vertica
• Provision for output (models) to be available as both bytecode
and as other (legacy) formats
• Provision for supporting custom model training workflows
including post-processing
• API for accessing individual functionality
• Needed around 8-9 man-months to complete the project
HashingTF in SparkML
© 2017 24/7 Customer, Inc. All rights reserved.
• HashingTF is a way of automated feature engineering
from textual data using hashing trick
• Essentially, using this method, one can project text to a large
multidimensional space, thereby capturing nuanced features
UTF-8	
Encoding
hashBytes
Byte	to	Int conversion
Multiply/Rotate/
Add/Shift/XOR
Mixing
Constants
Hashed	Value
Index	ScalingNumber	of	
Features
TF	Computation
HashingTF vector
Array	of	Features
Using HashingTF
© 2017 24/7 Customer, Inc. All rights reserved.
• Using hashingTF can replace multiple preprocessing
steps for ML model training:
• Dealing with categorical variables
• Custom feature extraction (e.g., using regular expression) from
text data
• Example: Categorizing URL’s
• In our comparison experiments, we have noticed similar
or better results from models using hashingTF vs models
developed the traditional way
• Effect was more prominent when the original model included
multiple custom-created feature from large amount of text
Other Benefits of using Spark
© 2017 24/7 Customer, Inc. All rights reserved.
• Model training is much faster compared to the
legacy method
• We are able to use distributed computation among the
nodes
• For a model trained on 1M rows, we see 2x-5x
improvement
• Innovative deployment of production models
• Uses a mix of javascript code and java byte-serialized
code for a DAG of models
• Complex models in spark format (byte-serialized) runs
faster
• Faster cycle of model training, testing and
deployment as the same underlying infrastructure is
used
Future work on the platform
© 2017 24/7 Customer, Inc. All rights reserved.
• We are exploring training of other complex
models on the spark platform
• Deep learning models for chatbot conversations
using MXNET
• We have worked on some innovations in
sampling, solving optimization problems and
training svm models in the spark library
• Would like to share those with the spark community
Questions
© 2017 24/7 Customer, Inc. All rights reserved.

More Related Content

What's hot

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...HostedbyConfluent
 
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...HostedbyConfluent
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsLightbend
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streamingdatamantra
 
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per DayHadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per DayAnkur Bansal
 
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...confluent
 
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...HostedbyConfluent
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureDatabricks
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberHostedbyConfluent
 
Container Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycleContainer Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycleMichael Mueller
 
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...HostedbyConfluent
 
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...HostedbyConfluent
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...HostedbyConfluent
 
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Timo Walther
 
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
How Much Can You Connect? | Bhavesh Raheja, Disney + HotstarHow Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
How Much Can You Connect? | Bhavesh Raheja, Disney + HotstarHostedbyConfluent
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...confluent
 
A Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
A Microservices approach with Cassandra and Quarkus | DevNation Tech TalkA Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
A Microservices approach with Cassandra and Quarkus | DevNation Tech TalkRed Hat Developers
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafkaconfluent
 

What's hot (20)

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
 
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per DayHadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
 
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
 
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
 
Container Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycleContainer Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycle
 
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
 
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
 
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
 
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
How Much Can You Connect? | Bhavesh Raheja, Disney + HotstarHow Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
 
A Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
A Microservices approach with Cassandra and Quarkus | DevNation Tech TalkA Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
A Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 

Similar to Automating Prediction Model Life-cycles with Spark

OC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMOC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMBig Data Joe™ Rossi
 
SD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMSD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMBig Data Joe™ Rossi
 
Vishwanath_M_CV_NL
Vishwanath_M_CV_NLVishwanath_M_CV_NL
Vishwanath_M_CV_NLVishwanath M
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...DataWorks Summit/Hadoop Summit
 
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy ClustersData Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy ClustersDavid Walker
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Cloudera, Inc.
 
Data Science Salon: Applying Machine Learning to Modernize Business Processes
Data Science Salon: Applying Machine Learning to Modernize Business ProcessesData Science Salon: Applying Machine Learning to Modernize Business Processes
Data Science Salon: Applying Machine Learning to Modernize Business ProcessesFormulatedby
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
Confluent Partner Tech Talk with BearingPoint
Confluent Partner Tech Talk with BearingPointConfluent Partner Tech Talk with BearingPoint
Confluent Partner Tech Talk with BearingPointconfluent
 
Accelerate Rapid Software Innovation with Virtustream Pivotal Cloud Foundry S...
Accelerate Rapid Software Innovation with Virtustream Pivotal Cloud Foundry S...Accelerate Rapid Software Innovation with Virtustream Pivotal Cloud Foundry S...
Accelerate Rapid Software Innovation with Virtustream Pivotal Cloud Foundry S...VMware Tanzu
 
Achieve New Heights with Modern Analytics
Achieve New Heights with Modern AnalyticsAchieve New Heights with Modern Analytics
Achieve New Heights with Modern AnalyticsSense Corp
 
CV_Vasili_Tegza 2G
CV_Vasili_Tegza 2GCV_Vasili_Tegza 2G
CV_Vasili_Tegza 2GVasyl Tegza
 
CSC - Presentation at Hortonworks Booth - Strata 2014
CSC - Presentation at Hortonworks Booth - Strata 2014CSC - Presentation at Hortonworks Booth - Strata 2014
CSC - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
Data mining, prediction and machine learning with Sitecore xDB
Data mining, prediction and machine learning with Sitecore xDBData mining, prediction and machine learning with Sitecore xDB
Data mining, prediction and machine learning with Sitecore xDBashiga
 
Make from your it department a competitive differentiator for your business
Make from your it department a competitive differentiator for your businessMake from your it department a competitive differentiator for your business
Make from your it department a competitive differentiator for your businessMarcos Quezada
 

Similar to Automating Prediction Model Life-cycles with Spark (20)

OC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMOC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBM
 
SD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMSD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBM
 
Satish_Vishwa
Satish_VishwaSatish_Vishwa
Satish_Vishwa
 
Osource Company Profile
Osource Company ProfileOsource Company Profile
Osource Company Profile
 
Rushcode overview
Rushcode overviewRushcode overview
Rushcode overview
 
Vishwanath_M_CV_NL
Vishwanath_M_CV_NLVishwanath_M_CV_NL
Vishwanath_M_CV_NL
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
 
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy ClustersData Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
 
AbhishekKapuria
AbhishekKapuriaAbhishekKapuria
AbhishekKapuria
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8
 
Data Science Salon: Applying Machine Learning to Modernize Business Processes
Data Science Salon: Applying Machine Learning to Modernize Business ProcessesData Science Salon: Applying Machine Learning to Modernize Business Processes
Data Science Salon: Applying Machine Learning to Modernize Business Processes
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Confluent Partner Tech Talk with BearingPoint
Confluent Partner Tech Talk with BearingPointConfluent Partner Tech Talk with BearingPoint
Confluent Partner Tech Talk with BearingPoint
 
Accelerate Rapid Software Innovation with Virtustream Pivotal Cloud Foundry S...
Accelerate Rapid Software Innovation with Virtustream Pivotal Cloud Foundry S...Accelerate Rapid Software Innovation with Virtustream Pivotal Cloud Foundry S...
Accelerate Rapid Software Innovation with Virtustream Pivotal Cloud Foundry S...
 
Achieve New Heights with Modern Analytics
Achieve New Heights with Modern AnalyticsAchieve New Heights with Modern Analytics
Achieve New Heights with Modern Analytics
 
CV_Vasili_Tegza 2G
CV_Vasili_Tegza 2GCV_Vasili_Tegza 2G
CV_Vasili_Tegza 2G
 
CSC - Presentation at Hortonworks Booth - Strata 2014
CSC - Presentation at Hortonworks Booth - Strata 2014CSC - Presentation at Hortonworks Booth - Strata 2014
CSC - Presentation at Hortonworks Booth - Strata 2014
 
Data mining, prediction and machine learning with Sitecore xDB
Data mining, prediction and machine learning with Sitecore xDBData mining, prediction and machine learning with Sitecore xDB
Data mining, prediction and machine learning with Sitecore xDB
 
Make from your it department a competitive differentiator for your business
Make from your it department a competitive differentiator for your businessMake from your it department a competitive differentiator for your business
Make from your it department a competitive differentiator for your business
 

More from datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 APIdatamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scaladatamantra
 

More from datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 

Recently uploaded

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Recently uploaded (20)

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Automating Prediction Model Life-cycles with Spark

  • 1. A Spark-stack for Automating Life- cycle of Prediction Models © 2017 24/7 Customer, Inc. All rights reserved. Monday, November 13, 2017 Samik Raychaudhuri, Ph.D. Director, Data Science Group [24]7.ai Innovation Labs Bangalore Apache Spark Meetup Nov 2017
  • 2. Agenda © 2017 24/7 Customer, Inc. All rights reserved. • Introduction • Use cases for ML models at [24]7 • Model management at [24]7 • The Spark Stack • Conclusion
  • 3. Prediction Models in [24]7.ai Monday, November 13, 2017 © 2017 24/7 Customer, Inc. All rights reserved.
  • 4. About [24]7 © 2017 24/7 Customer, Inc. All rights reserved. • [24]7 is a software company based out of Bay area, US and Bangalore, India, delivering customer support solutions enhanced by predictive technologies • Using predictive models to drive enhanced customer experience is an emerging and niche area of application of analytics and big data • Our machine learning models on big data predict the customer intent across various touchpoints in real time, helping us provide an intuitive experience when the customers (of our clients) contact us
  • 5. 2.5B Digital Interactions/Year 4.5TB Interaction Data/Week 90%+ CSAT across channels 100M Visitors/Year 1st True Multi-modal Solution 1st Omni-channel Solution We deliver a cloud-based software platform that uses predictive analytics and big data to make company-to- consumer connections intuitive. [24]7 - World’s Largest Self-Service Network © 2017 24/7 Customer, Inc. All rights reserved.
  • 6. Assist (for Chat) Smart chat platform for online and mobile engagement Assist (for IVR) Call deflection to mobile web chat for higher NPS and ROI Assist (for Voice) Smart voice agent platform for multi- modal engagement of voice callers SELF SERVICE PRODUCTS ASSISTED SERVICE PRODUCTS © 2014. 24/7 Customer, INC. All rights reserved. CONFIDENTIAL Predictive Sales Drive higher incremental revenue and customer acquisition Predictive Service Reduce customer effort to increase CSAT and NPS in customer service Chat Agents Chat agent services that engage customers and help reduce costs, generate revenue, and improve CSAT Voice Agents Voice agent services that engage customers and help reduce costs, generate revenue, and improve CSAT SOLUTIONS SERVICES Social Social sharing Mobile Mobile self-service Vivid Speech Mobile for IVR Speech Speech self-service IVR [24]7 iLabs: A Quick Snapshot
  • 7. Data Science – What it means for [24]7 © 2017 24/7 Customer, Inc. All rights reserved. fn (Customer type, location, Identity, interaction context, journey, behavior …) Intent: Purchase; issue with product or service, … Customer Intent Engine Intent Models fn (Identity, ntent type, history, channel affinity, customer value…) Measure: usage, containment, repeat… Engagement Engine Guided self- service “” Cha t Phon e Sales Resolution Experience Retention Metrics: conversion rate, revenue, CSAT, … Outcomes Machine Learning At Scale Creating Personalized Intuitive Consumer Experiences
  • 8. Big Data in [24]7 © 2017 24/7 Customer, Inc. All rights reserved. Data Sources Technologies
  • 9. Use case of intent prediction: Web visits © 2017 24/7 Customer, Inc. All rights reserved. • For our clients in the retail vertical, we provide chat agents who are experienced in providing differentiated support • The differentiation is based on: • Current phase of the journey • Specific persona of the visitor • We use ML models to compute probabilities of various intents, and use them to provide customized intervention for sales and service journeys
  • 10. Use case of intent prediction: IVR Calls © 2017 24/7 Customer, Inc. All rights reserved. • For our clients in banking, our IVR platform provide self- service options for service journeys • The challenge is to resolve the issues faced by the customer within the IVR platform itself • One of our flagship offering is our natural language understanding engine from free-flowing response • Again, we use ML models to compute probabilities of various intents from the response, and use them to provide specific service or transfer to a voice agent alongwith context
  • 11. Use case of intent prediction: within Chat © 2017 24/7 Customer, Inc. All rights reserved. • An emerging use case is deploying AI-assisted Virtual Agents (chatbots) for verious enterprise use cases • The challenges here are: • To detect intent from natural language texts, and then provide natural language response – essentially continue a natural conversation • To be able to bring in human agents when the conversation goes out-of-scope for the VA. • We are using ML models to detect intent and state from the conversation and take appropriate action
  • 12. Technology and Model Management at [24]7 Monday, November 13, 2017 © 2017 24/7 Customer, Inc. All rights reserved.
  • 13. High Level Architecture © 2017 24/7 Customer, Inc. All rights reserved. 13 Events Real Time Platform Batch Data Platform Events Reporting and BI Predictions Models
  • 14. [24]7 Big Data Platform: Technologies © 2017 24/7 Customer, Inc. All rights reserved. • We use multiple open-source technologies to power our platform. Some of the technologies in use: • Real Time Platform • Apache Cassandra ring [http://cassandra.apache.org/] • Jetty server for execution [http://www.eclipse.org/jetty/] • Batch Data Platform • Apache Hadoop [http://hadoop.apache.org/] • Apache Hive [http://hive.apache.org/] • Apache Spark [http://spark.apache.org/] [Upcoming] • Others • Apache Kafka [http://kafka.apache.org/] • Apache Avro [http://avro.apache.org/] • HP Vertica database [http://www.vertica.com/] • Apache Pig [http://pig.apache.org/] • Apache Druid [https://druid.apache.org/]
  • 15. Architecture for model building © 2017 24/7 Customer, Inc. All rights reserved. 15 Events Batch Data Platform HDFS Nightly MR Jobs Structured Datamart Regular Model Building Model Management Platform Analytics & Monitoring Retraining R&D Model Building Deploy Trained Model
  • 16. Model building workflow © 2017 24/7 Customer, Inc. All rights reserved. Sign Contract Data Requirement Gathering Data Capture Exploratory Data Analysis Model Building Simulation Model Deployment Monitoring and Retraining
  • 17. Platform for Model Management – Why? © 2017 24/7 Customer, Inc. All rights reserved. • Prediction models are one of the key piece to achieve targets set in the contract, however it is part of a larger workflow – needs standardization • Standard transformations: We now support a set of standard transformations, coded in the same standard way in any model • Standard libraries: Different libraries in different software ecosystem (e.g., R, Python, Spark ML etc.) produce slightly different result. With this platform, we can compare models, or select one runtime to deploy models • Skill can become an issue when working on prediction models for various clients – the platform takes skill out of the equation by providing templates encoding best practices
  • 18. Spark Stack for Model Management Monday, November 13, 2017 © 2017 24/7 Customer, Inc. All rights reserved.
  • 19. Early Iteration for Model Management Platform © 2017 24/7 Customer, Inc. All rights reserved. • Model management platform was originally built on top of Vertica • Vertica from HP (now MicroFocus) is a columnar database with strong analytical query capabilities • We loaded the output of MR jobs in Vertica, which acted as our datamart • Model training workflow was managed by Oozie • The actual job of training models were performed in the Vertica cluster using Vertica UDF’s written in C++ and R
  • 20. Early Iteration for Model Management Platform © 2017 24/7 Customer, Inc. All rights reserved. 20 Events Batch Data Platform HDFS Nightly MR Jobs Structured Datamart: Vertica Regular Model Building Model Management Platform: Vertica UDFs + Oozie workflows Analytics & Monitoring Retraining R&D Model Building Deploy Trained Model
  • 21. Pros and Cons of using Vertica © 2017 24/7 Customer, Inc. All rights reserved. • Pros • All the EDA and computations happened in-database, thus there were no substantial data movement for model building • Vertica supports SQL and R, thus resulting in easy onboarding for analysts and data scientists • Custom code for feature engineering from existing columns • Cons • Speed of computation was limited by the cluster size of Vertica • R UDFs cannot be parallelized, thereby limiting the amount of distributed computations that can be done while training complex models • In some cases, hard to maintain or find R libraries compatible with Vertica • Compatibility issues in general • Small community of developers • Cumbersome model deployment • License requirement vs existing spark cluster
  • 22. Moving to Spark © 2017 24/7 Customer, Inc. All rights reserved. • Spark is a strong distributed computation engine with huge community supporting it • It is general purpose, helping to deploy scripts/codes for data preparation as well as monitoring • SparkML has matured with lots of features, quick bug fixes and (again) active community • We wanted to expand model building to more use-cases, and the required data were already available in HDFS • Spark models can be directly deployed on our production JVM stack • We already had a Spark cluster which was getting used for ad-hoc queries • Eliminates the need of specific feature engineering by using hashing tricks
  • 23. Model Management Platform with Spark © 2017 24/7 Customer, Inc. All rights reserved. 23 Events Batch Data Platform HDFS Structured Datamart: HDFS/Vertica Regular Model Building Model Management Platform: Spark Cluster Analytics & Monitoring Retraining R&D Model Building Deploy Trained Model Nightly Jobs (MR+Spark)
  • 24. Developing the Framework © 2017 24/7 Customer, Inc. All rights reserved. • The framework is a wrapper around spark libraries developed in-house in Scala • Has specialized modules to manage: • Provision for config reading and validation • Provision for reading data from HDFS (through Hive) and Vertica • Provision for output (models) to be available as both bytecode and as other (legacy) formats • Provision for supporting custom model training workflows including post-processing • API for accessing individual functionality • Needed around 8-9 man-months to complete the project
  • 25. HashingTF in SparkML © 2017 24/7 Customer, Inc. All rights reserved. • HashingTF is a way of automated feature engineering from textual data using hashing trick • Essentially, using this method, one can project text to a large multidimensional space, thereby capturing nuanced features UTF-8 Encoding hashBytes Byte to Int conversion Multiply/Rotate/ Add/Shift/XOR Mixing Constants Hashed Value Index ScalingNumber of Features TF Computation HashingTF vector Array of Features
  • 26. Using HashingTF © 2017 24/7 Customer, Inc. All rights reserved. • Using hashingTF can replace multiple preprocessing steps for ML model training: • Dealing with categorical variables • Custom feature extraction (e.g., using regular expression) from text data • Example: Categorizing URL’s • In our comparison experiments, we have noticed similar or better results from models using hashingTF vs models developed the traditional way • Effect was more prominent when the original model included multiple custom-created feature from large amount of text
  • 27. Other Benefits of using Spark © 2017 24/7 Customer, Inc. All rights reserved. • Model training is much faster compared to the legacy method • We are able to use distributed computation among the nodes • For a model trained on 1M rows, we see 2x-5x improvement • Innovative deployment of production models • Uses a mix of javascript code and java byte-serialized code for a DAG of models • Complex models in spark format (byte-serialized) runs faster • Faster cycle of model training, testing and deployment as the same underlying infrastructure is used
  • 28. Future work on the platform © 2017 24/7 Customer, Inc. All rights reserved. • We are exploring training of other complex models on the spark platform • Deep learning models for chatbot conversations using MXNET • We have worked on some innovations in sampling, solving optimization problems and training svm models in the spark library • Would like to share those with the spark community
  • 29. Questions © 2017 24/7 Customer, Inc. All rights reserved.