Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
[Taipei.py] improving user experience with text mining and deep learning in Uber
1. Paul Lo, 2018/12 @ Taipei.py
Data Analytics @ Uber, Asia-Pacific Community Operation Central team
paullo0106@gmail.com | http://paullo.myvnc.com/blog/
Improving User Experience with Text Mining
and Deep Learning in Uber
2. Project #1
Text ming tool to unlock user insights
Python lib: natural language processing,
topic modeling
Self-introduction
Who am I?
What does our analytics team do for
Asia-Pacific?
Project #2
Deep learning-based answering bot
for call center
Python lib: machine learning related
such as tensorflow, keras, sklearn,
numpy, and etc.
Improving User Experience with Text Mining and Deep Learning in Uber
Table of contents
Improving User
Experience with Text
Mining and Deep
Learning in Uber
4. Scope of Community Operation in Uber APAC
Scope
10+ languages in ~20 countries
Central Team
based in
Manila,
Singapore,
India
India
Singapore (South East and North Asia)
Australia
5. Data @ Uber
Uber’s Data Lake
Stores 30+ Petabytes of data
~M clusters across N data centers
(thousands of servers)
So how much data is that really?
~100,000 years of music
Which is 50x the amount of music streamed on spotify
each year
50+ billion books or 50 million kindles
Equivalent to the entire written works of mankind from
the beginning of recorded history, in all languages
150+ years of 24/7 Full HD video recording
The amount of storage required to render 50 Avatar
movies, simultaneously
How big is Big data?
6. Data-driven business decision culture
Data helps us to tell the story to public and operate better
Typical policy and communications questions:
● How many jobs does Uber provide in Taipei?
● How is Uber pool reducing congestion in Manila?
● What proportion of our trips start or end at public transportation?
** Uber開源城市交通資料 : https://movement.uber.com
Typical city operation questions:
● Do we have enough drivers for the New Year?
● How can we reduce the ETA for our riders?
● When is best to introduce EATS delivery fee in my city?
7. Data tools to support Big data
Source: https://eng.uber.com
8. What’s our roles at Uber
Uber’s Data Lake
App + Support Data:
Rides, Eats, and etc
Payments Data:
Collection, Payments
External Data:
Traffic, Weather,
Holidays, Maps
Machine learning
platform
Programming
interface
Query interface
Internal BI Tools
Company-wide
dashboards
Marketing Data:
Clicks, Impressions,
Sentiment
9. Improving user experience is one of our core mission
Improve user experience
Drive down defect rate
Optimize operational efficiency
Manage the cost of business operation
10. Project #1:
Text mining and NLP for use experience
enhancement
Acknowledgement: Troy James Palanca, Lorenzo Ampil
11. Value proposition
Speed up the workflow on user experience enhancement
Defect rate and issue type
Leaderboard
Community
Operation
Product,
Engineering,
and etc.
User
feedback
database
Root cause analysis
and recommended
feature or policy
changes
Review
customer
feedback in
tickets
User experience
enhancement
12. Value proposition
Speed up the workflow on user experience enhancement
Defect rate and issue type
Leaderboard
Community
Operation
Product,
Engineering,
and etc.
User
feedback
database
Root cause analysis
and recommended
feature or policy
changes
Review
customer
feedback in
tickets
User experience
enhancement
Making this process more efficient
13. Problem
How can we quickly get the insights from users’ feedback?
Problem
Reviewing tickets
manually to diagnose
the root cause is not
scalable and
unsystematic
Ticket dataset
Driver > Trips > Fare … > … > Technical issue
ticket
ticket ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
14. Problem
How can we quickly get the insights from users’ feedback?
Solution
Use topic modeling
techniques to
efficiently group tickets
and assign them to
reasonably named
topics.
Ticket dataset
Driver > Trips > Fare … > … > Technical issue
App stuck/ crash
(35%)
Fare calculation
Dispute
(15%)
GPS issue
(55%)
15. Key features of our solution
Using Topic modeling based tool to learn pain points from our users
Ticket snippet with user profile: respective ticket
samples are displayed when clicking on a keyword
Word cloud view: user can switch to
this view to see most relevant (tf-idf
score) keywords in each topic
>>DEMO
16. Sample results
“Fare Disputes” in one of the city we operate are
mainly about payments, airport issues, and wrong
riders:
● Credit cards and other modes of payment
(18%)
● Overcharging (28.8%)
● Wrong profiles being billed (12.8%)
● Airport terminal issues (12.9%)
● Someone else taking the trip (12.5%)
17. Sample results
Lots of “rude”, “loud music”, “drunk”, and “slam door” keywords
were detected as the pain points of our NY driver partners
18. Sample results
More than 10% of driver cancellation
tickets in Singapore are related to car
seat rules for child safety: many
sample tickets show that drivers want to
reimburse their cancellation fee due to
their riders bringing children without prior
notice.
19. Tool architecture
Computing node
(any Uber servers)
Data collection
Data preparation
LDA model training
Web server
(AWS node)
Html and json
files from
training results
User Interface
(d3js)
Train the model for each country with top issues
monthly
20. Workflow overview
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
21. Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Text processing library: nltk, BeautifulSoup, re, TextBlob
LDA library: gensim.ldamodel.LdaModel and pyLDAvis
22. Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words:
● Numbers
● Html tags
● Custom dictionary
Stemming and lemmatization
Tokenization
TFIDF (Term Frequency Inverse Document
Frequency)
23. Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words:
● Numbers
re.sub(r'd+', '', text)
● Html tags
BeautifulSoup(document).get_text()
BeautifulSoup(document).find_all(‘b’)
● Custom dictionary
Stemming and lemmatization
Tokenization
TFIDF (Term Frequency Inverse Document
Frequency)
24. Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words
Stemming and lemmatization: Reduce inflectional
forms and sometimes derivationally related forms of a
word to a common base form. For instance:
○ cancel, cancels, cancelled -> cancel
○ riders, rider -> rider
Tokenization
TFIDF (Term Frequency Inverse Document
Frequency)
25. Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words
Stemming and lemmatization
Tokenization: Part-of-speech based word
detection
TFIDF (Term Frequency Inverse Document
Frequency)
26. Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words
Stemming and lemmatization
Tokenization: Part-of-speech based word
detection
TFIDF (Term Frequency Inverse Document
Frequency) Common practice to score each term
with weighted frequency and relevance
27. Data Preparation (Natural Language Processing)
Using TFIDF to filter the most important keywords
Machine Learning
Model
28. Data Preparation (Natural Language Processing)
Using TFIDF to filter the most important keywords
Machine Learning
Model
Term frequency
Inverse Document
Frequency
29. Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Data preparation can be very time-consuming
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words:
Stemming and lemmatization
Tokenization
TFIDF (Term Frequency Inverse Document
Frequency)
30. Speed up data processing
Pandas runs on a single thread by default
A pandas DataFrame with 50k+ rows
Data Preparation
text_processing() is a heavy function
contains many things:
● Tokenization
● Removal of numbers, html tags, and
other invalid words
● Stemming and lemmatization
● TFIDF
df['content'].apply(text_processing)
→ single thread by default
31. Speed up data processing
Pandas runs on a single thread by default
Worker 1
Worker 2
Worker N
keywords
32. Data processing speedup trick in Pandas
Pandas runs on a single thread by default
1
2
3
4
5
6
7
8
9
10
33. Many handy text processing libraries
TextBlob as an example
Tokenization Sentence correction
.correct()
Part of speech
.tags
Sentiment analysis
.sentiment.polarity
NLP Library
(TextBlob)
(spaCy)
34. Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content - but how?
Sample ~50,000 tickets for
each training in each issue
category
LDA:
- Unsupervised learning
- Bag of words
- “topic distribution”
Usage:
lda = LdaModel(corpus=corpus,
id2word=dictionary,
num_topics=4,
random_state=some_number)
lda.show_topics()
35. Latent Dirichlet Allocation model
General concept of this model
Unsupervised learning method - does not
require any class labels; similar to clustering
‘Bag of words’ model - uses word counts in
messages without regard for its order
(Peter owe Alice money = Alice owe Peter
money)
Estimated iteratively - Starts with random
initialization then adjusts probabilities to
reduce perplexity / increase fit
(EM; Expectation Maximization)
Doc 1 Doc 2 Doc 3 Doc n...
(topic) FruitsFruits
document-topic
probabilities
30% health (topic
1)
60% fruits
(topic 2)
10% disease
(topic 3)
36. Latent Dirichlet Allocation model
Model implementation and visualization
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Sample ~50,000 tickets for
each training in each issue
category
Usage:
lda = LdaModel(corpus=corpus,
id2word=dictionary,
num_topics=4,
random_state=some_number)
lda.show_topics()
from pyLDAvis.gensim import prepare, save_html
from gensim.models import LdaModel
37. Future work and learnings
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Customization is needed
● Not suited for
specific issue
category
● Build own
dictionary for the
removal of
irrelevant words
Data input: ticket text as raw
data
Output: topic model clusters
How to make the results more useful and actionable?
● # of topic for convergence
● Time and performance
tradeoff
● Other ”Deep NLP” model ?
Word2vec
GloVe
Fasttext
38. Project #1
Text ming tool to unlock user insights
Python lib: natural language processing,
topic modeling
Self-introduction
Who am I?
What does our analytics team do for
Asia-Pacific?
Project #2
Deep learning-based answering bot
for call center
Python lib: machine learning related
such as tensorflow, keras, sklearn,
numpy, and etc.
Improving User Experience with Text Mining and Deep Learning in Uber
Table of contents
Improving User
Experience with Text
Mining and Deep
Learning in Uber
39. Product owner: Huaixiu Zheng and Yichia Wang, Hugh Williams in Uber’s Applied Machine Learning team
Project #2:
Artificial Intelligence revolution in call centers
40. CSR’s sample workflow to respond user in a call center
How does our users submit an issue?
41. CSR’s sample workflow to respond user in a call center
Online support via in-app-help
42. The issue for call center operation: scalability and cost
The growth comes at a price again….
43. Solution? Let’s start from a basic sample
“I want to change my rating for a rider” - very rule-based deterministic flow
44. The business impact of a simple bot-solving solution
3k+ weekly solves
A team of
18 CSR
28k USD
monthly
47. Our machine learning solution design
Why go with “Semi-automated” assistance rather than real robot?
Product designed by Hugh Williams, Huaixiu Zheng, Yi-Chia Wang in Applied Machine Learning team
48. Our machine learning solution design
‘Assistant to CSR’ - Provide suggestions for reply and actions
Issue category/ type suggestion
Action suggestion
10M+ tickets
Correct response from
agents to these 10M+
tickets
Technical model training Product design
49. Typical Machine Learning process
Note: picture from “Mark Peng’s “General Tips for participating Kaggle Competitions” on Slideshare
50. Typical Machine Learning process
Model selection
ML 101:
Start with simple model first
Data source: https://eng.uber.com/cota-v2/
53. Deep Learning Architecture
Reference: Uber AML Lab: http://eng.uber.com/cota
arXiv Paper: COTA: Improving the Speed and Accuracy of Customer Support through Ranking and Deep Networks (link)
CNN: Max pooling
Optimizers: Adam (SGD, RMSProp), Batch Normalization
Regularization: L2 Reg, Dropout, Batch Normalization, early stopping
54. Development environment for Deep learning model training
How does model training look like?
>> DEMO
Main codebase + data set
GRID K520
55. Feature engineering and feature importance
Trade off between capacity and interpretability
“Capacity”
“Interpretability”
56. Feature engineering and feature importance
What are the important features? Very easy to learn that in simpler model
57. Feature engineering and feature importance
What are the important features? Very easy to get explanation in simpler models
58. Feature engineering and feature importance
What are the important features? NN model is like our brain’s intuition … blackbox
60. Feature engineering and feature importance
What are the important features?
Sklearn: Recursive feature elimination
(sklearn.feature_selection.RFE)
Mockup
dataset
61. Feature engineering and feature importance
What are the important features?
Time on model training >>> prediction
Shuffle each feature to create noise…. on the testing set
Mockup
dataset
62. Feature engineering and feature importance
What are the important features?
Shuffle each feature to create noise…. on the testing set
Mockup
example
63. Why NumPy is faster?
Python Vectorization: Single Instruction, Multiple Data (SIMD)
64. Why NumPy is faster?
Python Vectorization: Single Instruction, Multiple Data (SIMD)
65. Why NumPy is faster?
Python Vectorization: Locality of reference (Spatial Locality)
Java/ C++ versus Python…...
66. Issue category suggestion
Action suggestion
Product design
Last stop: making business Impact
Ensure KPI measurement is well-planned in the beginning
67. Last stop: making business Impact
Identify key business metrics, and cautiously conduct and monitor experiments
Source: https://eng.uber.com/cota-v2/
Experiment notes:
* Network effect → Switch
back instead of A/B test
* Guardrail variable and
decision variable (risk control)
* Monitoring versus peeking
* Novelty effect
69. Other leanings
How to become a better programmer, or data scientist?
● Long-term growth: Not just know how to call APIs →
○ Understand what’s happening beneath (math and low-level
manipulation are key)
○ Understand pros and cons of your tool/ model/ framework
choice
● Coding at scale: Resource and infra are rich, but data is also
huge (as well as the risk) → time and space optimization
optimization but not overdesign
● Communication: Everybody is busy → organize and
communicate your work well, and build good social relationship
70. Recommended reading 推薦閱讀
How to become a better programmer, or data scientist? 多看書,多寫扣,多分享
Data Science from Scratch: 用python學資料科學
這本很推薦,也可以嘗試看原文的版本
Python資料運算與分析實戰
Numpy, Scipy, Pandas
日本人寫程式的書也很厲害...
流暢的Python
Java我的推薦聖經是Effective Java
這本可能還沒到那個程度,但也推薦!
71. Recommended reading
How to become a better programmer, or data scientist? Read & Code & Share, and repeat
Machine Learning and Deep Learning with Python
Focus on scikit-learn and TensorFlow
Data Science from Scratch
Highly recommend: Python-based hand-in-hand
On classical concepts and algorithms