SlideShare a Scribd company logo
1 of 72
Download to read offline
Paul Lo, 2018/12 @ Taipei.py
Data Analytics @ Uber, Asia-Pacific Community Operation Central team
paullo0106@gmail.com | http://paullo.myvnc.com/blog/
Improving User Experience with Text Mining
and Deep Learning in Uber
Project #1
Text ming tool to unlock user insights
Python lib: natural language processing,
topic modeling
Self-introduction
Who am I?
What does our analytics team do for
Asia-Pacific?
Project #2
Deep learning-based answering bot
for call center
Python lib: machine learning related
such as tensorflow, keras, sklearn,
numpy, and etc.
Improving User Experience with Text Mining and Deep Learning in Uber
Table of contents
Improving User
Experience with Text
Mining and Deep
Learning in Uber
Self-introduction
Skills: Full stack software engineer (Java/ Python) → Data Analyst (Python/ R, databases, machine learning)
Self-introduction
Scope of Community Operation in Uber APAC
Scope
10+ languages in ~20 countries
Central Team
based in
Manila,
Singapore,
India
India
Singapore (South East and North Asia)
Australia
Data @ Uber
Uber’s Data Lake
Stores 30+ Petabytes of data
~M clusters across N data centers
(thousands of servers)
So how much data is that really?
~100,000 years of music
Which is 50x the amount of music streamed on spotify
each year
50+ billion books or 50 million kindles
Equivalent to the entire written works of mankind from
the beginning of recorded history, in all languages
150+ years of 24/7 Full HD video recording
The amount of storage required to render 50 Avatar
movies, simultaneously
How big is Big data?
Data-driven business decision culture
Data helps us to tell the story to public and operate better
Typical policy and communications questions:
● How many jobs does Uber provide in Taipei?
● How is Uber pool reducing congestion in Manila?
● What proportion of our trips start or end at public transportation?
** Uber開源城市交通資料 : https://movement.uber.com
Typical city operation questions:
● Do we have enough drivers for the New Year?
● How can we reduce the ETA for our riders?
● When is best to introduce EATS delivery fee in my city?
Data tools to support Big data
Source: https://eng.uber.com
What’s our roles at Uber
Uber’s Data Lake
App + Support Data:
Rides, Eats, and etc
Payments Data:
Collection, Payments
External Data:
Traffic, Weather,
Holidays, Maps
Machine learning
platform
Programming
interface
Query interface
Internal BI Tools
Company-wide
dashboards
Marketing Data:
Clicks, Impressions,
Sentiment
Improving user experience is one of our core mission
Improve user experience
Drive down defect rate
Optimize operational efficiency
Manage the cost of business operation
Project #1:
Text mining and NLP for use experience
enhancement
Acknowledgement: Troy James Palanca, Lorenzo Ampil
Value proposition
Speed up the workflow on user experience enhancement
Defect rate and issue type
Leaderboard
Community
Operation
Product,
Engineering,
and etc.
User
feedback
database
Root cause analysis
and recommended
feature or policy
changes
Review
customer
feedback in
tickets
User experience
enhancement
Value proposition
Speed up the workflow on user experience enhancement
Defect rate and issue type
Leaderboard
Community
Operation
Product,
Engineering,
and etc.
User
feedback
database
Root cause analysis
and recommended
feature or policy
changes
Review
customer
feedback in
tickets
User experience
enhancement
Making this process more efficient
Problem
How can we quickly get the insights from users’ feedback?
Problem
Reviewing tickets
manually to diagnose
the root cause is not
scalable and
unsystematic
Ticket dataset
Driver > Trips > Fare … > … > Technical issue
ticket
ticket ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
Problem
How can we quickly get the insights from users’ feedback?
Solution
Use topic modeling
techniques to
efficiently group tickets
and assign them to
reasonably named
topics.
Ticket dataset
Driver > Trips > Fare … > … > Technical issue
App stuck/ crash
(35%)
Fare calculation
Dispute
(15%)
GPS issue
(55%)
Key features of our solution
Using Topic modeling based tool to learn pain points from our users
Ticket snippet with user profile: respective ticket
samples are displayed when clicking on a keyword
Word cloud view: user can switch to
this view to see most relevant (tf-idf
score) keywords in each topic
>>DEMO
Sample results
“Fare Disputes” in one of the city we operate are
mainly about payments, airport issues, and wrong
riders:
● Credit cards and other modes of payment
(18%)
● Overcharging (28.8%)
● Wrong profiles being billed (12.8%)
● Airport terminal issues (12.9%)
● Someone else taking the trip (12.5%)
Sample results
Lots of “rude”, “loud music”, “drunk”, and “slam door” keywords
were detected as the pain points of our NY driver partners
Sample results
More than 10% of driver cancellation
tickets in Singapore are related to car
seat rules for child safety: many
sample tickets show that drivers want to
reimburse their cancellation fee due to
their riders bringing children without prior
notice.
Tool architecture
Computing node
(any Uber servers)
Data collection
Data preparation
LDA model training
Web server
(AWS node)
Html and json
files from
training results
User Interface
(d3js)
Train the model for each country with top issues
monthly
Workflow overview
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Text processing library: nltk, BeautifulSoup, re, TextBlob
LDA library: gensim.ldamodel.LdaModel and pyLDAvis
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words:
● Numbers
● Html tags
● Custom dictionary
Stemming and lemmatization
Tokenization
TFIDF (Term Frequency Inverse Document
Frequency)
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words:
● Numbers
re.sub(r'd+', '', text)
● Html tags
BeautifulSoup(document).get_text()
BeautifulSoup(document).find_all(‘b’)
● Custom dictionary
Stemming and lemmatization
Tokenization
TFIDF (Term Frequency Inverse Document
Frequency)
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words
Stemming and lemmatization: Reduce inflectional
forms and sometimes derivationally related forms of a
word to a common base form. For instance:
○ cancel, cancels, cancelled -> cancel
○ riders, rider -> rider
Tokenization
TFIDF (Term Frequency Inverse Document
Frequency)
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words
Stemming and lemmatization
Tokenization: Part-of-speech based word
detection
TFIDF (Term Frequency Inverse Document
Frequency)
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words
Stemming and lemmatization
Tokenization: Part-of-speech based word
detection
TFIDF (Term Frequency Inverse Document
Frequency) Common practice to score each term
with weighted frequency and relevance
Data Preparation (Natural Language Processing)
Using TFIDF to filter the most important keywords
Machine Learning
Model
Data Preparation (Natural Language Processing)
Using TFIDF to filter the most important keywords
Machine Learning
Model
Term frequency
Inverse Document
Frequency
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Data preparation can be very time-consuming
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words:
Stemming and lemmatization
Tokenization
TFIDF (Term Frequency Inverse Document
Frequency)
Speed up data processing
Pandas runs on a single thread by default
A pandas DataFrame with 50k+ rows
Data Preparation
text_processing() is a heavy function
contains many things:
● Tokenization
● Removal of numbers, html tags, and
other invalid words
● Stemming and lemmatization
● TFIDF
df['content'].apply(text_processing)
→ single thread by default
Speed up data processing
Pandas runs on a single thread by default
Worker 1
Worker 2
Worker N
keywords
Data processing speedup trick in Pandas
Pandas runs on a single thread by default
1
2
3
4
5
6
7
8
9
10
Many handy text processing libraries
TextBlob as an example
Tokenization Sentence correction
.correct()
Part of speech
.tags
Sentiment analysis
.sentiment.polarity
NLP Library
(TextBlob)
(spaCy)
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content - but how?
Sample ~50,000 tickets for
each training in each issue
category
LDA:
- Unsupervised learning
- Bag of words
- “topic distribution”
Usage:
lda = LdaModel(corpus=corpus,
id2word=dictionary,
num_topics=4,
random_state=some_number)
lda.show_topics()
Latent Dirichlet Allocation model
General concept of this model
Unsupervised learning method - does not
require any class labels; similar to clustering
‘Bag of words’ model - uses word counts in
messages without regard for its order
(Peter owe Alice money = Alice owe Peter
money)
Estimated iteratively - Starts with random
initialization then adjusts probabilities to
reduce perplexity / increase fit
(EM; Expectation Maximization)
Doc 1 Doc 2 Doc 3 Doc n...
(topic) FruitsFruits
document-topic
probabilities
30% health (topic
1)
60% fruits
(topic 2)
10% disease
(topic 3)
Latent Dirichlet Allocation model
Model implementation and visualization
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Sample ~50,000 tickets for
each training in each issue
category
Usage:
lda = LdaModel(corpus=corpus,
id2word=dictionary,
num_topics=4,
random_state=some_number)
lda.show_topics()
from pyLDAvis.gensim import prepare, save_html
from gensim.models import LdaModel
Future work and learnings
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Customization is needed
● Not suited for
specific issue
category
● Build own
dictionary for the
removal of
irrelevant words
Data input: ticket text as raw
data
Output: topic model clusters
How to make the results more useful and actionable?
● # of topic for convergence
● Time and performance
tradeoff
● Other ”Deep NLP” model ?
Word2vec
GloVe
Fasttext
Project #1
Text ming tool to unlock user insights
Python lib: natural language processing,
topic modeling
Self-introduction
Who am I?
What does our analytics team do for
Asia-Pacific?
Project #2
Deep learning-based answering bot
for call center
Python lib: machine learning related
such as tensorflow, keras, sklearn,
numpy, and etc.
Improving User Experience with Text Mining and Deep Learning in Uber
Table of contents
Improving User
Experience with Text
Mining and Deep
Learning in Uber
Product owner: Huaixiu Zheng and Yichia Wang, Hugh Williams in Uber’s Applied Machine Learning team
Project #2:
Artificial Intelligence revolution in call centers
CSR’s sample workflow to respond user in a call center
How does our users submit an issue?
CSR’s sample workflow to respond user in a call center
Online support via in-app-help
The issue for call center operation: scalability and cost
The growth comes at a price again….
Solution? Let’s start from a basic sample
“I want to change my rating for a rider” - very rule-based deterministic flow
The business impact of a simple bot-solving solution
3k+ weekly solves
A team of
18 CSR
28k USD
monthly
What’s the problem with this solution?
“Scalability”
The difference between Programming and Machine Learning
Our machine learning solution design
Why go with “Semi-automated” assistance rather than real robot?
Product designed by Hugh Williams, Huaixiu Zheng, Yi-Chia Wang in Applied Machine Learning team
Our machine learning solution design
‘Assistant to CSR’ - Provide suggestions for reply and actions
Issue category/ type suggestion
Action suggestion
10M+ tickets
Correct response from
agents to these 10M+
tickets
Technical model training Product design
Typical Machine Learning process
Note: picture from “Mark Peng’s “General Tips for participating Kaggle Competitions” on Slideshare
Typical Machine Learning process
Model selection
ML 101:
Start with simple model first
Data source: https://eng.uber.com/cota-v2/
Deep Learning Architecture
Reference: Uber AML Lab: http://eng.uber.com/cota
1000+ multiclass problem
10M English tickets (10-day)
Deep Learning Architecture
Reference: Uber AML Lab: http://eng.uber.com/cota
Sample code with Keras for a simple CNN
Deep Learning Architecture
Reference: Uber AML Lab: http://eng.uber.com/cota
arXiv Paper: COTA: Improving the Speed and Accuracy of Customer Support through Ranking and Deep Networks (link)
CNN: Max pooling
Optimizers: Adam (SGD, RMSProp), Batch Normalization
Regularization: L2 Reg, Dropout, Batch Normalization, early stopping
Development environment for Deep learning model training
How does model training look like?
>> DEMO
Main codebase + data set
GRID K520
Feature engineering and feature importance
Trade off between capacity and interpretability
“Capacity”
“Interpretability”
Feature engineering and feature importance
What are the important features? Very easy to learn that in simpler model
Feature engineering and feature importance
What are the important features? Very easy to get explanation in simpler models
Feature engineering and feature importance
What are the important features? NN model is like our brain’s intuition … blackbox
Feature engineering process
What are the important features?
Trick: 資料量太大, 重新跑模型很久 →
把”測試資料”裡面的一個個feature打亂以快速得到結果
Feature engineering and feature importance
What are the important features?
Sklearn: Recursive feature elimination
(sklearn.feature_selection.RFE)
Mockup
dataset
Feature engineering and feature importance
What are the important features?
Time on model training >>> prediction
Shuffle each feature to create noise…. on the testing set
Mockup
dataset
Feature engineering and feature importance
What are the important features?
Shuffle each feature to create noise…. on the testing set
Mockup
example
Why NumPy is faster?
Python Vectorization: Single Instruction, Multiple Data (SIMD)
Why NumPy is faster?
Python Vectorization: Single Instruction, Multiple Data (SIMD)
Why NumPy is faster?
Python Vectorization: Locality of reference (Spatial Locality)
Java/ C++ versus Python…...
Issue category suggestion
Action suggestion
Product design
Last stop: making business Impact
Ensure KPI measurement is well-planned in the beginning
Last stop: making business Impact
Identify key business metrics, and cautiously conduct and monitor experiments
Source: https://eng.uber.com/cota-v2/
Experiment notes:
* Network effect → Switch
back instead of A/B test
* Guardrail variable and
decision variable (risk control)
* Monitoring versus peeking
* Novelty effect
Other leanings
How to become a better programmer, or data scientist?
Other leanings
How to become a better programmer, or data scientist?
● Long-term growth: Not just know how to call APIs →
○ Understand what’s happening beneath (math and low-level
manipulation are key)
○ Understand pros and cons of your tool/ model/ framework
choice
● Coding at scale: Resource and infra are rich, but data is also
huge (as well as the risk) → time and space optimization
optimization but not overdesign
● Communication: Everybody is busy → organize and
communicate your work well, and build good social relationship
Recommended reading 推薦閱讀
How to become a better programmer, or data scientist? 多看書,多寫扣,多分享
Data Science from Scratch: 用python學資料科學
這本很推薦,也可以嘗試看原文的版本
Python資料運算與分析實戰
Numpy, Scipy, Pandas
日本人寫程式的書也很厲害...
流暢的Python
Java我的推薦聖經是Effective Java
這本可能還沒到那個程度,但也推薦!
Recommended reading
How to become a better programmer, or data scientist? Read & Code & Share, and repeat
Machine Learning and Deep Learning with Python
Focus on scikit-learn and TensorFlow
Data Science from Scratch
Highly recommend: Python-based hand-in-hand
On classical concepts and algorithms
Paul Lo
Data Analytics @ Uber
paul.lo *a*t uber.com | paullo0106 a*t* gmail.com
Q&A

More Related Content

What's hot

Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Ed Fernandez
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
IBM Cloud Data Services
 

What's hot (19)

PPT5: Neuron Introduction
PPT5: Neuron IntroductionPPT5: Neuron Introduction
PPT5: Neuron Introduction
 
Ml product page
Ml product pageMl product page
Ml product page
 
Ml product page
Ml product pageMl product page
Ml product page
 
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
ETL & Machine Learning
ETL & Machine LearningETL & Machine Learning
ETL & Machine Learning
 
Intro_to_ML
Intro_to_MLIntro_to_ML
Intro_to_ML
 
Mentoring Session with Innovesia: Advance Robotics
Mentoring Session with Innovesia: Advance RoboticsMentoring Session with Innovesia: Advance Robotics
Mentoring Session with Innovesia: Advance Robotics
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
 
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AI
 
Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
 
Wolfram alpha A Computational Knowledge Engine Interesting Technology
Wolfram alpha A Computational Knowledge Engine  Interesting Technology Wolfram alpha A Computational Knowledge Engine  Interesting Technology
Wolfram alpha A Computational Knowledge Engine Interesting Technology
 
Machine learning using spark Online Training
Machine learning using spark Online TrainingMachine learning using spark Online Training
Machine learning using spark Online Training
 

Similar to [Taipei.py] improving user experience with text mining and deep learning in Uber

Strata - Final_IB_02_17
Strata - Final_IB_02_17Strata - Final_IB_02_17
Strata - Final_IB_02_17
Irina Borisova
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
renjan131
 
Best Data Science Online Training in Hyderabad
  Best Data Science Online Training in Hyderabad  Best Data Science Online Training in Hyderabad
Best Data Science Online Training in Hyderabad
bharathtsofttech
 

Similar to [Taipei.py] improving user experience with text mining and deep learning in Uber (20)

[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...
 
Good Applications of Bad Machine Translation
Good Applications of Bad Machine TranslationGood Applications of Bad Machine Translation
Good Applications of Bad Machine Translation
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 
DCXS best selfcare-solutions DynamicFAQ
DCXS best selfcare-solutions DynamicFAQDCXS best selfcare-solutions DynamicFAQ
DCXS best selfcare-solutions DynamicFAQ
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
AI in Multi Billion Search Engines. Career building in AI / Search. What make...
AI in Multi Billion Search Engines. Career building in AI / Search. What make...AI in Multi Billion Search Engines. Career building in AI / Search. What make...
AI in Multi Billion Search Engines. Career building in AI / Search. What make...
 
Leverage the power of machine learning on windows
Leverage the power of machine learning on windowsLeverage the power of machine learning on windows
Leverage the power of machine learning on windows
 
mlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecyclemlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecycle
 
Itpe brief
Itpe briefItpe brief
Itpe brief
 
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
 
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
 
Strata - Final_IB_02_17
Strata - Final_IB_02_17Strata - Final_IB_02_17
Strata - Final_IB_02_17
 
Bootcamp_AIAppsUCSD.pptx
Bootcamp_AIAppsUCSD.pptxBootcamp_AIAppsUCSD.pptx
Bootcamp_AIAppsUCSD.pptx
 
Bootcamp_AIApps.pdf
Bootcamp_AIApps.pdfBootcamp_AIApps.pdf
Bootcamp_AIApps.pdf
 
Bootcamp_AIApps.pdf
Bootcamp_AIApps.pdfBootcamp_AIApps.pdf
Bootcamp_AIApps.pdf
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
 
NESMA - More than just points
NESMA - More than just pointsNESMA - More than just points
NESMA - More than just points
 
Best Data Science Online Training in Hyderabad
  Best Data Science Online Training in Hyderabad  Best Data Science Online Training in Hyderabad
Best Data Science Online Training in Hyderabad
 
3 Software Estmation.ppt
3 Software Estmation.ppt3 Software Estmation.ppt
3 Software Estmation.ppt
 
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleData Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
 

Recently uploaded

Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 

Recently uploaded (20)

Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 

[Taipei.py] improving user experience with text mining and deep learning in Uber

  • 1. Paul Lo, 2018/12 @ Taipei.py Data Analytics @ Uber, Asia-Pacific Community Operation Central team paullo0106@gmail.com | http://paullo.myvnc.com/blog/ Improving User Experience with Text Mining and Deep Learning in Uber
  • 2. Project #1 Text ming tool to unlock user insights Python lib: natural language processing, topic modeling Self-introduction Who am I? What does our analytics team do for Asia-Pacific? Project #2 Deep learning-based answering bot for call center Python lib: machine learning related such as tensorflow, keras, sklearn, numpy, and etc. Improving User Experience with Text Mining and Deep Learning in Uber Table of contents Improving User Experience with Text Mining and Deep Learning in Uber
  • 3. Self-introduction Skills: Full stack software engineer (Java/ Python) → Data Analyst (Python/ R, databases, machine learning) Self-introduction
  • 4. Scope of Community Operation in Uber APAC Scope 10+ languages in ~20 countries Central Team based in Manila, Singapore, India India Singapore (South East and North Asia) Australia
  • 5. Data @ Uber Uber’s Data Lake Stores 30+ Petabytes of data ~M clusters across N data centers (thousands of servers) So how much data is that really? ~100,000 years of music Which is 50x the amount of music streamed on spotify each year 50+ billion books or 50 million kindles Equivalent to the entire written works of mankind from the beginning of recorded history, in all languages 150+ years of 24/7 Full HD video recording The amount of storage required to render 50 Avatar movies, simultaneously How big is Big data?
  • 6. Data-driven business decision culture Data helps us to tell the story to public and operate better Typical policy and communications questions: ● How many jobs does Uber provide in Taipei? ● How is Uber pool reducing congestion in Manila? ● What proportion of our trips start or end at public transportation? ** Uber開源城市交通資料 : https://movement.uber.com Typical city operation questions: ● Do we have enough drivers for the New Year? ● How can we reduce the ETA for our riders? ● When is best to introduce EATS delivery fee in my city?
  • 7. Data tools to support Big data Source: https://eng.uber.com
  • 8. What’s our roles at Uber Uber’s Data Lake App + Support Data: Rides, Eats, and etc Payments Data: Collection, Payments External Data: Traffic, Weather, Holidays, Maps Machine learning platform Programming interface Query interface Internal BI Tools Company-wide dashboards Marketing Data: Clicks, Impressions, Sentiment
  • 9. Improving user experience is one of our core mission Improve user experience Drive down defect rate Optimize operational efficiency Manage the cost of business operation
  • 10. Project #1: Text mining and NLP for use experience enhancement Acknowledgement: Troy James Palanca, Lorenzo Ampil
  • 11. Value proposition Speed up the workflow on user experience enhancement Defect rate and issue type Leaderboard Community Operation Product, Engineering, and etc. User feedback database Root cause analysis and recommended feature or policy changes Review customer feedback in tickets User experience enhancement
  • 12. Value proposition Speed up the workflow on user experience enhancement Defect rate and issue type Leaderboard Community Operation Product, Engineering, and etc. User feedback database Root cause analysis and recommended feature or policy changes Review customer feedback in tickets User experience enhancement Making this process more efficient
  • 13. Problem How can we quickly get the insights from users’ feedback? Problem Reviewing tickets manually to diagnose the root cause is not scalable and unsystematic Ticket dataset Driver > Trips > Fare … > … > Technical issue ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket
  • 14. Problem How can we quickly get the insights from users’ feedback? Solution Use topic modeling techniques to efficiently group tickets and assign them to reasonably named topics. Ticket dataset Driver > Trips > Fare … > … > Technical issue App stuck/ crash (35%) Fare calculation Dispute (15%) GPS issue (55%)
  • 15. Key features of our solution Using Topic modeling based tool to learn pain points from our users Ticket snippet with user profile: respective ticket samples are displayed when clicking on a keyword Word cloud view: user can switch to this view to see most relevant (tf-idf score) keywords in each topic >>DEMO
  • 16. Sample results “Fare Disputes” in one of the city we operate are mainly about payments, airport issues, and wrong riders: ● Credit cards and other modes of payment (18%) ● Overcharging (28.8%) ● Wrong profiles being billed (12.8%) ● Airport terminal issues (12.9%) ● Someone else taking the trip (12.5%)
  • 17. Sample results Lots of “rude”, “loud music”, “drunk”, and “slam door” keywords were detected as the pain points of our NY driver partners
  • 18. Sample results More than 10% of driver cancellation tickets in Singapore are related to car seat rules for child safety: many sample tickets show that drivers want to reimburse their cancellation fee due to their riders bringing children without prior notice.
  • 19. Tool architecture Computing node (any Uber servers) Data collection Data preparation LDA model training Web server (AWS node) Html and json files from training results User Interface (d3js) Train the model for each country with top issues monthly
  • 20. Workflow overview Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category
  • 21. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Text processing library: nltk, BeautifulSoup, re, TextBlob LDA library: gensim.ldamodel.LdaModel and pyLDAvis
  • 22. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Remove invalid words: ● Numbers ● Html tags ● Custom dictionary Stemming and lemmatization Tokenization TFIDF (Term Frequency Inverse Document Frequency)
  • 23. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Remove invalid words: ● Numbers re.sub(r'd+', '', text) ● Html tags BeautifulSoup(document).get_text() BeautifulSoup(document).find_all(‘b’) ● Custom dictionary Stemming and lemmatization Tokenization TFIDF (Term Frequency Inverse Document Frequency)
  • 24. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Remove invalid words Stemming and lemmatization: Reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance: ○ cancel, cancels, cancelled -> cancel ○ riders, rider -> rider Tokenization TFIDF (Term Frequency Inverse Document Frequency)
  • 25. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Remove invalid words Stemming and lemmatization Tokenization: Part-of-speech based word detection TFIDF (Term Frequency Inverse Document Frequency)
  • 26. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Remove invalid words Stemming and lemmatization Tokenization: Part-of-speech based word detection TFIDF (Term Frequency Inverse Document Frequency) Common practice to score each term with weighted frequency and relevance
  • 27. Data Preparation (Natural Language Processing) Using TFIDF to filter the most important keywords Machine Learning Model
  • 28. Data Preparation (Natural Language Processing) Using TFIDF to filter the most important keywords Machine Learning Model Term frequency Inverse Document Frequency
  • 29. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Data preparation can be very time-consuming Sample ~50,000 tickets for each training in each issue category Remove invalid words: Stemming and lemmatization Tokenization TFIDF (Term Frequency Inverse Document Frequency)
  • 30. Speed up data processing Pandas runs on a single thread by default A pandas DataFrame with 50k+ rows Data Preparation text_processing() is a heavy function contains many things: ● Tokenization ● Removal of numbers, html tags, and other invalid words ● Stemming and lemmatization ● TFIDF df['content'].apply(text_processing) → single thread by default
  • 31. Speed up data processing Pandas runs on a single thread by default Worker 1 Worker 2 Worker N keywords
  • 32. Data processing speedup trick in Pandas Pandas runs on a single thread by default 1 2 3 4 5 6 7 8 9 10
  • 33. Many handy text processing libraries TextBlob as an example Tokenization Sentence correction .correct() Part of speech .tags Sentiment analysis .sentiment.polarity NLP Library (TextBlob) (spaCy)
  • 34. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content - but how? Sample ~50,000 tickets for each training in each issue category LDA: - Unsupervised learning - Bag of words - “topic distribution” Usage: lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=4, random_state=some_number) lda.show_topics()
  • 35. Latent Dirichlet Allocation model General concept of this model Unsupervised learning method - does not require any class labels; similar to clustering ‘Bag of words’ model - uses word counts in messages without regard for its order (Peter owe Alice money = Alice owe Peter money) Estimated iteratively - Starts with random initialization then adjusts probabilities to reduce perplexity / increase fit (EM; Expectation Maximization) Doc 1 Doc 2 Doc 3 Doc n... (topic) FruitsFruits document-topic probabilities 30% health (topic 1) 60% fruits (topic 2) 10% disease (topic 3)
  • 36. Latent Dirichlet Allocation model Model implementation and visualization Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Sample ~50,000 tickets for each training in each issue category Usage: lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=4, random_state=some_number) lda.show_topics() from pyLDAvis.gensim import prepare, save_html from gensim.models import LdaModel
  • 37. Future work and learnings Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Customization is needed ● Not suited for specific issue category ● Build own dictionary for the removal of irrelevant words Data input: ticket text as raw data Output: topic model clusters How to make the results more useful and actionable? ● # of topic for convergence ● Time and performance tradeoff ● Other ”Deep NLP” model ? Word2vec GloVe Fasttext
  • 38. Project #1 Text ming tool to unlock user insights Python lib: natural language processing, topic modeling Self-introduction Who am I? What does our analytics team do for Asia-Pacific? Project #2 Deep learning-based answering bot for call center Python lib: machine learning related such as tensorflow, keras, sklearn, numpy, and etc. Improving User Experience with Text Mining and Deep Learning in Uber Table of contents Improving User Experience with Text Mining and Deep Learning in Uber
  • 39. Product owner: Huaixiu Zheng and Yichia Wang, Hugh Williams in Uber’s Applied Machine Learning team Project #2: Artificial Intelligence revolution in call centers
  • 40. CSR’s sample workflow to respond user in a call center How does our users submit an issue?
  • 41. CSR’s sample workflow to respond user in a call center Online support via in-app-help
  • 42. The issue for call center operation: scalability and cost The growth comes at a price again….
  • 43. Solution? Let’s start from a basic sample “I want to change my rating for a rider” - very rule-based deterministic flow
  • 44. The business impact of a simple bot-solving solution 3k+ weekly solves A team of 18 CSR 28k USD monthly
  • 45. What’s the problem with this solution? “Scalability”
  • 46. The difference between Programming and Machine Learning
  • 47. Our machine learning solution design Why go with “Semi-automated” assistance rather than real robot? Product designed by Hugh Williams, Huaixiu Zheng, Yi-Chia Wang in Applied Machine Learning team
  • 48. Our machine learning solution design ‘Assistant to CSR’ - Provide suggestions for reply and actions Issue category/ type suggestion Action suggestion 10M+ tickets Correct response from agents to these 10M+ tickets Technical model training Product design
  • 49. Typical Machine Learning process Note: picture from “Mark Peng’s “General Tips for participating Kaggle Competitions” on Slideshare
  • 50. Typical Machine Learning process Model selection ML 101: Start with simple model first Data source: https://eng.uber.com/cota-v2/
  • 51. Deep Learning Architecture Reference: Uber AML Lab: http://eng.uber.com/cota 1000+ multiclass problem 10M English tickets (10-day)
  • 52. Deep Learning Architecture Reference: Uber AML Lab: http://eng.uber.com/cota Sample code with Keras for a simple CNN
  • 53. Deep Learning Architecture Reference: Uber AML Lab: http://eng.uber.com/cota arXiv Paper: COTA: Improving the Speed and Accuracy of Customer Support through Ranking and Deep Networks (link) CNN: Max pooling Optimizers: Adam (SGD, RMSProp), Batch Normalization Regularization: L2 Reg, Dropout, Batch Normalization, early stopping
  • 54. Development environment for Deep learning model training How does model training look like? >> DEMO Main codebase + data set GRID K520
  • 55. Feature engineering and feature importance Trade off between capacity and interpretability “Capacity” “Interpretability”
  • 56. Feature engineering and feature importance What are the important features? Very easy to learn that in simpler model
  • 57. Feature engineering and feature importance What are the important features? Very easy to get explanation in simpler models
  • 58. Feature engineering and feature importance What are the important features? NN model is like our brain’s intuition … blackbox
  • 59. Feature engineering process What are the important features? Trick: 資料量太大, 重新跑模型很久 → 把”測試資料”裡面的一個個feature打亂以快速得到結果
  • 60. Feature engineering and feature importance What are the important features? Sklearn: Recursive feature elimination (sklearn.feature_selection.RFE) Mockup dataset
  • 61. Feature engineering and feature importance What are the important features? Time on model training >>> prediction Shuffle each feature to create noise…. on the testing set Mockup dataset
  • 62. Feature engineering and feature importance What are the important features? Shuffle each feature to create noise…. on the testing set Mockup example
  • 63. Why NumPy is faster? Python Vectorization: Single Instruction, Multiple Data (SIMD)
  • 64. Why NumPy is faster? Python Vectorization: Single Instruction, Multiple Data (SIMD)
  • 65. Why NumPy is faster? Python Vectorization: Locality of reference (Spatial Locality) Java/ C++ versus Python…...
  • 66. Issue category suggestion Action suggestion Product design Last stop: making business Impact Ensure KPI measurement is well-planned in the beginning
  • 67. Last stop: making business Impact Identify key business metrics, and cautiously conduct and monitor experiments Source: https://eng.uber.com/cota-v2/ Experiment notes: * Network effect → Switch back instead of A/B test * Guardrail variable and decision variable (risk control) * Monitoring versus peeking * Novelty effect
  • 68. Other leanings How to become a better programmer, or data scientist?
  • 69. Other leanings How to become a better programmer, or data scientist? ● Long-term growth: Not just know how to call APIs → ○ Understand what’s happening beneath (math and low-level manipulation are key) ○ Understand pros and cons of your tool/ model/ framework choice ● Coding at scale: Resource and infra are rich, but data is also huge (as well as the risk) → time and space optimization optimization but not overdesign ● Communication: Everybody is busy → organize and communicate your work well, and build good social relationship
  • 70. Recommended reading 推薦閱讀 How to become a better programmer, or data scientist? 多看書,多寫扣,多分享 Data Science from Scratch: 用python學資料科學 這本很推薦,也可以嘗試看原文的版本 Python資料運算與分析實戰 Numpy, Scipy, Pandas 日本人寫程式的書也很厲害... 流暢的Python Java我的推薦聖經是Effective Java 這本可能還沒到那個程度,但也推薦!
  • 71. Recommended reading How to become a better programmer, or data scientist? Read & Code & Share, and repeat Machine Learning and Deep Learning with Python Focus on scikit-learn and TensorFlow Data Science from Scratch Highly recommend: Python-based hand-in-hand On classical concepts and algorithms
  • 72. Paul Lo Data Analytics @ Uber paul.lo *a*t uber.com | paullo0106 a*t* gmail.com Q&A