Submit Search
Upload
Spark machine learning predicting customer churn
•
5 likes
•
1,306 views
Carol McDonald
Follow
Using Spark Machine learning to predict customer churn
Read less
Read more
Software
Report
Share
Report
Share
1 of 58
Download now
Download to read offline
Recommended
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Carol McDonald
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Carol McDonald
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep Learning
Carol McDonald
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Carol McDonald
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Carol McDonald
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
Carol McDonald
Applying Machine Learning to Live Patient Data
Applying Machine Learning to Live Patient Data
Carol McDonald
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Carol McDonald
Recommended
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Carol McDonald
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Carol McDonald
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep Learning
Carol McDonald
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Carol McDonald
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Carol McDonald
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
Carol McDonald
Applying Machine Learning to Live Patient Data
Applying Machine Learning to Live Patient Data
Carol McDonald
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Carol McDonald
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
Carol McDonald
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine Learning
Carol McDonald
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Carol McDonald
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
Carol McDonald
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Carol McDonald
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
Carol McDonald
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Carol McDonald
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Carol McDonald
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
Introduction to machine learning with GPUs
Introduction to machine learning with GPUs
Carol McDonald
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
MapR Technologies
Real time big data applications with hadoop ecosystem
Real time big data applications with hadoop ecosystem
Chris Huang
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
Carol McDonald
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
Mathieu Dumoulin
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
MapR Technologies
Scaling big-data-mining-infra2
Scaling big-data-mining-infra2
Chris Huang
Approaching real-time-hadoop
Approaching real-time-hadoop
Chris Huang
When Streaming Becomes Strategic
When Streaming Becomes Strategic
MapR Technologies
Spark graphx
Spark graphx
Carol McDonald
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
MapR Technologies
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
Kevin Lee
More Related Content
What's hot
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
Carol McDonald
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine Learning
Carol McDonald
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Carol McDonald
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
Carol McDonald
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Carol McDonald
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
Carol McDonald
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Carol McDonald
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Carol McDonald
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
Introduction to machine learning with GPUs
Introduction to machine learning with GPUs
Carol McDonald
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
MapR Technologies
Real time big data applications with hadoop ecosystem
Real time big data applications with hadoop ecosystem
Chris Huang
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
Carol McDonald
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
Mathieu Dumoulin
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
MapR Technologies
Scaling big-data-mining-infra2
Scaling big-data-mining-infra2
Chris Huang
Approaching real-time-hadoop
Approaching real-time-hadoop
Chris Huang
When Streaming Becomes Strategic
When Streaming Becomes Strategic
MapR Technologies
Spark graphx
Spark graphx
Carol McDonald
What's hot
(20)
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine Learning
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
Introduction to machine learning with GPUs
Introduction to machine learning with GPUs
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Real time big data applications with hadoop ecosystem
Real time big data applications with hadoop ecosystem
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
Scaling big-data-mining-infra2
Scaling big-data-mining-infra2
Approaching real-time-hadoop
Approaching real-time-hadoop
When Streaming Becomes Strategic
When Streaming Becomes Strategic
Spark graphx
Spark graphx
Similar to Spark machine learning predicting customer churn
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
MapR Technologies
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
Kevin Lee
FSI202 Machine Learning in Capital Markets
FSI202 Machine Learning in Capital Markets
Amazon Web Services
Vi sem
Vi sem
Lavesh Kaushik
Vadlamudi saketh30 (ml)
Vadlamudi saketh30 (ml)
Vadlamudi Saketh
big-data-anallytics.pptx
big-data-anallytics.pptx
Sangamesh Kalyan
Machine Learning With ML.NET
Machine Learning With ML.NET
Dev Raj Gautam
Imtiaz khan data_science_analytics
Imtiaz khan data_science_analytics
imtiaz khan
Data Mining 101
Data Mining 101
Ali Septiandri
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
7 inspiring Big Data factories in AWS
7 inspiring Big Data factories in AWS
Sebastien BONNOTTE
Using Machine Learning in the delivery of ads
Using Machine Learning in the delivery of ads
Ruth Garcia Gavilanes
Big Data LDN 2017: Serving Predictive Models with Redis
Big Data LDN 2017: Serving Predictive Models with Redis
Matt Stubbs
Alpine Tech Talk: System ML by Berthold Reinwald
Alpine Tech Talk: System ML by Berthold Reinwald
Chester Chen
Solving churn challenge in Big Data environment - Jelena Pekez
Solving churn challenge in Big Data environment - Jelena Pekez
Institute of Contemporary Sciences
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
KamleshKumar394
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating Example
Ian Downard
Data Mining - The Big Picture!
Data Mining - The Big Picture!
Khalid Salama
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
Similar to Spark machine learning predicting customer churn
(20)
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
FSI202 Machine Learning in Capital Markets
FSI202 Machine Learning in Capital Markets
Vi sem
Vi sem
Vadlamudi saketh30 (ml)
Vadlamudi saketh30 (ml)
big-data-anallytics.pptx
big-data-anallytics.pptx
Machine Learning With ML.NET
Machine Learning With ML.NET
Imtiaz khan data_science_analytics
Imtiaz khan data_science_analytics
Data Mining 101
Data Mining 101
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
7 inspiring Big Data factories in AWS
7 inspiring Big Data factories in AWS
Using Machine Learning in the delivery of ads
Using Machine Learning in the delivery of ads
Big Data LDN 2017: Serving Predictive Models with Redis
Big Data LDN 2017: Serving Predictive Models with Redis
Alpine Tech Talk: System ML by Berthold Reinwald
Alpine Tech Talk: System ML by Berthold Reinwald
Solving churn challenge in Big Data environment - Jelena Pekez
Solving churn challenge in Big Data environment - Jelena Pekez
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating Example
Data Mining - The Big Picture!
Data Mining - The Big Picture!
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
More from Carol McDonald
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Carol McDonald
Apache Spark Machine Learning
Apache Spark Machine Learning
Carol McDonald
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
Apache Spark streaming and HBase
Apache Spark streaming and HBase
Carol McDonald
Machine Learning Recommendations with Spark
Machine Learning Recommendations with Spark
Carol McDonald
Apache Spark Overview
Apache Spark Overview
Carol McDonald
Introduction to Spark
Introduction to Spark
Carol McDonald
CU9411MW.DOC
CU9411MW.DOC
Carol McDonald
Getting started with HBase
Getting started with HBase
Carol McDonald
Introduction to Spark on Hadoop
Introduction to Spark on Hadoop
Carol McDonald
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
Carol McDonald
More from Carol McDonald
(11)
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Apache Spark Machine Learning
Apache Spark Machine Learning
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache HBase
Apache Spark streaming and HBase
Apache Spark streaming and HBase
Machine Learning Recommendations with Spark
Machine Learning Recommendations with Spark
Apache Spark Overview
Apache Spark Overview
Introduction to Spark
Introduction to Spark
CU9411MW.DOC
CU9411MW.DOC
Getting started with HBase
Getting started with HBase
Introduction to Spark on Hadoop
Introduction to Spark on Hadoop
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
Recently uploaded
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
Łukasz Chruściel
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
Velvetech LLC
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
umasea
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
Andreas Granig
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
IdiosysTechnologies1
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
StefanoLambiase
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
Livetecs LLC
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
Technogeeks
Cyber security and its impact on E commerce
Cyber security and its impact on E commerce
manigoyal112
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
Ortus Solutions, Corp
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
BradBedford3
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
BrainSell Technologies
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
OnePlan Solutions
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
Wave PLM
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
Tier1 app
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
andrehoraa
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
Hr365.us smith
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
FerryKemperman
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
Marharyta Nedzelska
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio, Inc.
Recently uploaded
(20)
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
Cyber security and its impact on E commerce
Cyber security and its impact on E commerce
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Spark machine learning predicting customer churn
1.
© 2017 MapR
Technologies Spark Machine Learning Carol McDonald @caroljmcdonald
2.
© 2017 MapR
Technologies Agenda • Introduction to Machine Learning Techniques – Classification – Clustering • Use Decision Tree to Predict Customer Churn
3.
© 2017 MapR
Technologies What is Machine Learning? Data Build ModelTrain Algorithm Finds patterns New Data Use Model (prediction function) Predictions Contains patterns Recognizes patterns
4.
© 2017 MapR
Technologies Examples of ML Algorithms Supervised • Classification – Naïve Bayes – SVM – Random Decision Forests • Regression – Linear – Logistic Machine Learning Unsupervised • Clustering – K-means • Dimensionality reduction – Principal Component Analysis – SVD
5.
© 2017 MapR
Technologies Supervised Algorithms use labeled data Data features Build Model New Data features Predict Use Model
6.
© 2017 MapR
Technologies Supervised Machine Learning: Classification & Regression Classification Identifies category for item
7.
© 2017 MapR
Technologies Classification: Definition Form of ML that: • Identifies which category an item belongs to • Uses supervised learning algorithms – Data is labeled Sentiment
8.
© 2017 MapR
Technologies If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck swims walks quacks Features: walks quacks swims Features:
9.
© 2017 MapR
Technologies Car Insurance Fraud Example • What are we trying to predict? – This is the Label or Target outcome: – The amount of Fraud • What are the “if questions” or properties we can use to predict? – These are the Features: – The claim Amount
10.
© 2017 MapR
Technologies Label: Amount of Fraud Y X Feature: claimed amount Data point: fraud amount, claimed amount AmntFraud = intercept + coeff * claimedAmnt Car Insurance Fraud Regression Example
11.
© 2017 MapR
Technologies Credit Card Fraud Example • What are we trying to predict? – This is the Label: – The probability of Fraud • What are the “if questions” or properties we can use to predict? – These are the Features: – transaction amount, type of merchant, distance from and time since last transaction
12.
© 2017 MapR
Technologies Label Probabilty of Fraud 1 X Features: trans amount, type of store, Time Location difference last trans. Fraud 0 Not Fraud .5 Credit Card Fraud Logistic Regression Example
13.
© 2017 MapR
Technologies Supervised Learning: Classification & Regression • Classification: – identifies which category (eg fraud or not fraud) • Linear Regression: – predicts a value (eg amount of fraud) • Logistic Regression: – predicts a probability (eg probability of fraud)
14.
© 2017 MapR
Technologies Examples of ML Algorithms Machine Learning Unsupervised • Clustering – K-means • Dimensionality reduction – Principal Component Analysis – SVD Supervised • Classification – Naïve Bayes – SVM – Random Decision Forests • Regression – Linear – Logistic
15.
© 2017 MapR
Technologies Unsupervised Algorithms use Unlabeled data Customer GroupsBuild ModelTrain Algorithm Finds patterns New Customer Purchase Data Use Model (prediction function) Predict Group Contains patterns Recognizes patterns Customer purchase data
16.
© 2017 MapR
Technologies Unsupervised Machine Learning: Clustering Clustering group news articles into different categories
17.
© 2017 MapR
Technologies Clustering: Definition • Unsupervised learning task • Groups objects into clusters of high similarity
18.
© 2017 MapR
Technologies Clustering: Definition • Unsupervised learning task • Groups objects into clusters of high similarity – Search results grouping – Grouping of customers – Anomaly detection – Text categorization
19.
© 2017 MapR
Technologies Clustering: Example • Group similar objects
20.
© 2017 MapR
Technologies Clustering: Example • Group similar objects • Use MLlib K-means algorithm 1. Initialize coordinates to center of clusters (centroid) x x x x x
21.
© 2017 MapR
Technologies Clustering: Example • Group similar objects • Use MLlib K-means algorithm 1. Initialize coordinates to center of clusters (centroid) 2. Assign all points to nearest centroid x x x x x
22.
© 2017 MapR
Technologies Clustering: Example • Group similar objects • Use MLlib K-means algorithm 1. Initialize coordinates to center of clusters (centroid) 2. Assign all points to nearest centroid 3. Update centroids to center of points x x x x x
23.
© 2017 MapR
Technologies Clustering: Example • Group similar objects • Use MLlib K-means algorithm 1. Initialize coordinates to center of clusters (centroid) 2. Assign all points to nearest centroid 3. Update centroids to center of points 4. Repeat until conditions met x x x x x
24.
© 2017 MapR
Technologies Predict Churn
25.
© 2017 MapR
Technologies ML Discovery Model Building Model Training/ Building Training Set Test Model Predictions Test Set Evaluate Results Historical Data Deployed Model Predictions Data Discovery, Model Creation Production Feature Extraction Feature Extraction New Data Customer Data Call Center Records Web Clickstream Server Logs ● Churn Modelling
26.
© 2017 MapR
Technologies Telecom Customer Churn Data • State: string • Account length: integer • Area code: integer • International plan: string • Voice mail plan: string • Number vmail messages: integer • Total day minutes: double • Total day calls: integer • Total day charge: double • Total eve minutes: double • Total eve calls: integer • Total eve charge: double • Total night minutes: double • Total night calls: integer • Total night charge: double • Total intl minutes: double • Total intl calls: integer • Total intl charge: double • Customer service calls: integer
27.
© 2017 MapR
Technologies Customer Churn Example • What are we trying to predict? – This is the Label: – Did the customer churn? True or False • What are the “if questions” or properties we can use to predict? – These are the Features: – Number of Customer service calls, Total day minutes …
28.
© 2017 MapR
Technologies Decision Trees • Decision Tree for Classification prediction • Represents tree with nodes • IF THEN ELSE questions using features at each node • Answers branch to child nodes If the number of customer service calls < 3 If the total day minutes > 200 Churned: T If the total day minutes < 200 Churned: F T Churned: T Churned: F F FF TT
29.
© 2017 MapR
Technologies Example Decision Tree
30.
© 2017 MapR
Technologies Spark ML workflow
31.
© 2017 MapR
Technologies Spark ML workflow with a Pipeline Pipeline Estimator Extract Features Load Data Train Model Estimator Data frame Transformer Cross Validate Pipeline Model TransformerTest Data frame Evaluate fit Train Load Data Evaluator Predict With model Extract Features Evaluator transform
32.
© 2017 MapR
Technologies Zeppelin Notebook with Spark Data Engineer Data Scientist
33.
© 2017 MapR
Technologies Load the data into a Dataframe: Define the Schema case class Account(state: String, len: Integer, acode: String, intlplan: String, vplan: String, numvmail: Double, tdmins: Double, tdcalls: Double, tdcharge: Double, temins: Double, tecalls: Double, techarge: Double, tnmins: Double, tncalls: Double, tncharge: Double, timins: Double, ticalls: Double, ticharge: Double, numcs: Double, churn: String) Input CSV File sample: KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
34.
© 2017 MapR
Technologies Data Frame Load data Load the data into a Dataset val train: Dataset[Account] = spark.read.option("inferSchema", "false") .schema(schema).csv("/user/user01/data/churn-bigml-80.csv").as[Account]
35.
© 2017 MapR
Technologies Dataset merged with Dataframe in Spark 2.0, DataFrame APIs merged with Datasets APIs
36.
© 2017 MapR
Technologies Extract the Features Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors and Label Model Featurization Training Model Evaluation Best Model Label: Churned=T Features: Number customer Service calls Number day minutes Training Data Label: Churned=F Features: Number customer Service calls Number day minutes + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶
37.
© 2017 MapR
Technologies Data Frame Add column Use StringIndexer to map Strings to Numbers val ipindexer = new StringIndexer() .setInputCol("intlplan") .setOutputCol("iplanIndex”) Data Frame
38.
© 2017 MapR
Technologies Data Frame Add column Use StringIndexer to map churn True False to Numbers Val labelindexer = new StringIndexer() .setInputCol(”churn") .setOutputCol(”label”) Data Frame
39.
© 2017 MapR
Technologies Data Frame Load data Add column DataFrame + Features Use VectorAssembler to put features in vector column val featureCols = Array(”temins", "iplanIndex", "tdmins", "tdcalls”…) val assembler = new VectorAssembler() .setInputCols(featureCols) .setOutputCol("features")
40.
© 2017 MapR
Technologies Data Frame Load data transform Estimator val dTree = new DecisionTreeClassifier() .setLabelCol("label") .setFeaturesCol("features") Create DecisionTree Estimator, Set Label and Features DataFrame + Features
41.
© 2017 MapR
Technologies val pipeline = new Pipeline() .setStages(Array(ipindexer, labelindexer, assembler, dTree)) Put Feature Transformers and Estimator in Pipeline Pipeline ipIndexer feature transform assembler Dtree estimatorlabelindexer feature transform assemble Features Produce model
42.
© 2017 MapR
Technologies Spark ML workflow with a Pipeline Pipeline Transfomers Load Data estimator Train model Data frame Extract Features evaluator Pipeline Model Test Data frame evaluator Use fitted model Train Load Data fit transform
43.
© 2017 MapR
Technologies K-fold Cross-Validation Process Data Model Training/ Building Training Set Test Model Predictions Test Set data is randomly split into K partition training and test dataset pairs
44.
© 2017 MapR
Technologies K-fold Cross-Validation Process Data Model Training Training Set Test Model Predictions Test Set Train algorithm with training dataset
45.
© 2017 MapR
Technologies ML Cross-Validation Process Data Model Training Set Test Model Predictions Test Set Evaluate the model with the Test Set
46.
© 2017 MapR
Technologies K-fold Cross-Validation Process Data Model Training/ Building Training Set Test Model Predictions Test Set Train/Test loop K times Repeat K times select the Model produced by the best-performing set of parameters
47.
© 2017 MapR
Technologies Cross Validation transformation estimation pipeline Pipeline Cross Validator evaluatorParameter Grid fit Set up a CrossValidator with: • Parameter grid • Estimator (pipeline) • Evaluator Perform grid search based model selection
48.
© 2017 MapR
Technologies Parameter Tuning with CrossValidator with a Paramgrid CrossValidator • Given: – Estimator – Parameter grid – Evaluator • Find best parameters and model val paramGrid = new ParamGridBuilder() .addGrid(dTree.maxDepth, Array(2,3,4,5,6,7)).build() val evaluator= new BinaryClassificationEvaluator() .setLabelCol("label") .setRawPredictionCol("prediction") val crossval = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(3)
49.
© 2017 MapR
Technologies val cvModel = crossval.fit(ntrain) Cross Validator fit a model to the data Pipeline Cross Validator evaluatorParameter Grid fit Pipeline Model fit a model to the data with provided parameter grid
50.
© 2017 MapR
Technologies Evaluate the fitted model Pipeline Transfomers Load Data estimator Train model Data frame Extract Features evaluator Pipeline Model Test Data frame evaluator transform Train Load Data Predict With model Extract Features fit
51.
© 2017 MapR
Technologies fitted model Evaluate the Predictions from DecisionTree Estimator Evaluator transform Test features val predictions = cvModel.transform(test) val accuracy = evaluator.evaluate(predictions) evaluate prediction accuracy
52.
© 2017 MapR
Technologies Area under the ROC curve Accuracy is measured by the area under the ROC curve. The area measures correct classifications • An area of 1 represents a perfect test • an area of .5 represents a worthless test
53.
© 2017 MapR
Technologies To Learn More: • Read about and download example code • https://mapr.com/blog/churn-prediction-sparkml/
54.
© 2017 MapR
Technologies To Learn More: • End to End Application for Monitoring Uber Data using Spark ML • https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine- learning-streaming-and-kafka-api-part-1/
55.
© 2017 MapR
Technologies To Learn More: • MapR Free ODT http://learn.mapr.com/
56.
© 2017 MapR
Technologies For Q&A : • https://community.mapr.com/ • https://community.mapr.com/community/answers/pages/qa
57.
© 2017 MapR
Technologies Open Source Engines & Tools Commercial Engines & Applications Enterprise-Grade Platform Services DataProcessing Web-Scale Storage MapR-XD MapR-DB Search and Others Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability MapR Streams Cloud and Managed Services Search and Others UnifiedManagementandMonitoring Search and Others Event StreamingDatabase Custom Apps MapR Converged Data Platform HDFS API POSIX, NFS Kakfa APIHBase API OJAI API
58.
© 2017 MapR
Technologies Q&A ENGAGE WITH US
Download now