[Taipei.py] improving user experience with text mining and deep learning in Uber

Paul Lo, 2018/12 @ Taipei.py
Data Analytics @ Uber, Asia-Pacific Community Operation Central team
paullo0106@gmail.com | http://paullo.myvnc.com/blog/
Improving User Experience with Text Mining
and Deep Learning in Uber

Project #1
Text ming tool to unlock user insights
Python lib: natural language processing,
topic modeling
Self-introduction
Who am I?
What does our analytics team do for
Asia-Pacific?
Project #2
Deep learning-based answering bot
for call center
Python lib: machine learning related
such as tensorflow, keras, sklearn,
numpy, and etc.
Improving User Experience with Text Mining and Deep Learning in Uber
Table of contents
Improving User
Experience with Text
Mining and Deep
Learning in Uber

Self-introduction
Skills: Full stack software engineer (Java/ Python) → Data Analyst (Python/ R, databases, machine learning)
Self-introduction

Scope of Community Operation in Uber APAC
Scope
10+ languages in ~20 countries
Central Team
based in
Manila,
Singapore,
India
India
Singapore (South East and North Asia)
Australia

Data @ Uber
Uber’s Data Lake
Stores 30+ Petabytes of data
~M clusters across N data centers
(thousands of servers)
So how much data is that really?
~100,000 years of music
Which is 50x the amount of music streamed on spotify
each year
50+ billion books or 50 million kindles
Equivalent to the entire written works of mankind from
the beginning of recorded history, in all languages
150+ years of 24/7 Full HD video recording
The amount of storage required to render 50 Avatar
movies, simultaneously
How big is Big data?

Data-driven business decision culture
Data helps us to tell the story to public and operate better
Typical policy and communications questions:
● How many jobs does Uber provide in Taipei?
● How is Uber pool reducing congestion in Manila?
● What proportion of our trips start or end at public transportation?
** Uber開源城市交通資料 : https://movement.uber.com
Typical city operation questions:
● Do we have enough drivers for the New Year?
● How can we reduce the ETA for our riders?
● When is best to introduce EATS delivery fee in my city?

Data tools to support Big data
Source: https://eng.uber.com

What’s our roles at Uber
Uber’s Data Lake
App + Support Data:
Rides, Eats, and etc
Payments Data:
Collection, Payments
External Data:
Traffic, Weather,
Holidays, Maps
Machine learning
platform
Programming
interface
Query interface
Internal BI Tools
Company-wide
dashboards
Marketing Data:
Clicks, Impressions,
Sentiment

Improving user experience is one of our core mission
Improve user experience
Drive down defect rate
Optimize operational efficiency
Manage the cost of business operation

Project #1:
Text mining and NLP for use experience
enhancement
Acknowledgement: Troy James Palanca, Lorenzo Ampil

Value proposition
Speed up the workflow on user experience enhancement
Defect rate and issue type
Leaderboard
Community
Operation
Product,
Engineering,
and etc.
User
feedback
database
Root cause analysis
and recommended
feature or policy
changes
Review
customer
feedback in
tickets
User experience
enhancement

Value proposition
Speed up the workflow on user experience enhancement
Defect rate and issue type
Leaderboard
Community
Operation
Product,
Engineering,
and etc.
User
feedback
database
Root cause analysis
and recommended
feature or policy
changes
Review
customer
feedback in
tickets
User experience
enhancement
Making this process more efficient

Problem
How can we quickly get the insights from users’ feedback?
Problem
Reviewing tickets
manually to diagnose
the root cause is not
scalable and
unsystematic
Ticket dataset
Driver > Trips > Fare … > … > Technical issue
ticket
ticket ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket

Problem
How can we quickly get the insights from users’ feedback?
Solution
Use topic modeling
techniques to
efficiently group tickets
and assign them to
reasonably named
topics.
Ticket dataset
Driver > Trips > Fare … > … > Technical issue
App stuck/ crash
(35%)
Fare calculation
Dispute
(15%)
GPS issue
(55%)

Key features of our solution
Using Topic modeling based tool to learn pain points from our users
Ticket snippet with user profile: respective ticket
samples are displayed when clicking on a keyword
Word cloud view: user can switch to
this view to see most relevant (tf-idf
score) keywords in each topic
>>DEMO

Sample results
“Fare Disputes” in one of the city we operate are
mainly about payments, airport issues, and wrong
riders:
● Credit cards and other modes of payment
(18%)
● Overcharging (28.8%)
● Wrong profiles being billed (12.8%)
● Airport terminal issues (12.9%)
● Someone else taking the trip (12.5%)

Sample results
Lots of “rude”, “loud music”, “drunk”, and “slam door” keywords
were detected as the pain points of our NY driver partners

Sample results
More than 10% of driver cancellation
tickets in Singapore are related to car
seat rules for child safety: many
sample tickets show that drivers want to
reimburse their cancellation fee due to
their riders bringing children without prior
notice.

Tool architecture
Computing node
(any Uber servers)
Data collection
Data preparation
LDA model training
Web server
(AWS node)
Html and json
files from
training results
User Interface
(d3js)
Train the model for each country with top issues
monthly

Workflow overview
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category

Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
data
category
Text processing library: nltk, BeautifulSoup, re, TextBlob
LDA library: gensim.ldamodel.LdaModel and pyLDAvis

Workflow overview
Data Preparation
(text processing)
matrix
Data Modeling
(Latent Dirichlet
Allocation)
topic modeling
data
category
Remove invalid words:
● Numbers
● Html tags
● Custom dictionary
Stemming and lemmatization
Tokenization
TFIDF (Term Frequency Inverse Document
Frequency)

Workflow overview
Data Preparation
(text processing)
matrix
Data Modeling
(Latent Dirichlet
Allocation)
topic modeling
data
category
● Numbers
re.sub(r'd+', '', text)
● Html tags
BeautifulSoup(document).get_text()
BeautifulSoup(document).find_all(‘b’)
● Custom dictionary
Tokenization
Frequency)

Workflow overview
Data Preparation
(text processing)
matrix
Data Modeling
(Latent Dirichlet
Allocation)
topic modeling
data
category
Remove invalid words
Stemming and lemmatization: Reduce inflectional
forms and sometimes derivationally related forms of a
word to a common base form. For instance:
○ cancel, cancels, cancelled -> cancel
○ riders, rider -> rider
Tokenization
Frequency)

Workflow overview
Data Preparation
(text processing)
matrix
Data Modeling
(Latent Dirichlet
Allocation)
topic modeling
data
category
Tokenization: Part-of-speech based word
detection
Frequency)

Workflow overview
Data Preparation
(text processing)
matrix
Data Modeling
(Latent Dirichlet
Allocation)
topic modeling
data
category
Tokenization: Part-of-speech based word
detection
Frequency) Common practice to score each term
with weighted frequency and relevance

Data Preparation (Natural Language Processing)
Using TFIDF to filter the most important keywords
Machine Learning
Model

Data Preparation (Natural Language Processing)
Using TFIDF to filter the most important keywords
Machine Learning
Model
Term frequency
Inverse Document
Frequency

Workflow overview
Data Preparation
(text processing)
matrix
Data Modeling
(Latent Dirichlet
Allocation)
topic modeling
data
Data preparation can be very time-consuming
category
Tokenization
Frequency)

Speed up data processing
Pandas runs on a single thread by default
A pandas DataFrame with 50k+ rows
Data Preparation
text_processing() is a heavy function
contains many things:
● Tokenization
● Removal of numbers, html tags, and
other invalid words
● Stemming and lemmatization
● TFIDF
df['content'].apply(text_processing)
→ single thread by default

Speed up data processing
Worker 1
Worker 2
Worker N
keywords

Data processing speedup trick in Pandas
1
2
3
4
5
6
7
8
9
10

Many handy text processing libraries
TextBlob as an example
Tokenization Sentence correction
.correct()
Part of speech
.tags
Sentiment analysis
.sentiment.polarity
NLP Library
(TextBlob)
(spaCy)

Workflow overview
Data Preparation
(text processing)
matrix
Data Modeling
(Latent Dirichlet
Allocation)
topic modeling
data
Unlocking support insights from textual content - but how?
category
LDA:
- Unsupervised learning
- Bag of words
- “topic distribution”
Usage:
lda = LdaModel(corpus=corpus,
id2word=dictionary,
num_topics=4,
random_state=some_number)
lda.show_topics()

Latent Dirichlet Allocation model
General concept of this model
Unsupervised learning method - does not
require any class labels; similar to clustering
‘Bag of words’ model - uses word counts in
messages without regard for its order
(Peter owe Alice money = Alice owe Peter
money)
Estimated iteratively - Starts with random
initialization then adjusts probabilities to
reduce perplexity / increase fit
(EM; Expectation Maximization)
Doc 1 Doc 2 Doc 3 Doc n...
(topic) FruitsFruits
document-topic
probabilities
30% health (topic
1)
60% fruits
(topic 2)
10% disease
(topic 3)

Latent Dirichlet Allocation model
Model implementation and visualization
Data Preparation
(text processing)
matrix
Data Modeling
(Latent Dirichlet
Allocation)
topic modeling
data
category
Usage:
lda = LdaModel(corpus=corpus,
id2word=dictionary,
num_topics=4,
random_state=some_number)
lda.show_topics()
from pyLDAvis.gensim import prepare, save_html
from gensim.models import LdaModel

Future work and learnings
Data Preparation
(text processing)
matrix
Data Modeling
(Latent Dirichlet
Allocation)
topic modeling
Customization is needed
● Not suited for
specific issue
category
● Build own
dictionary for the
removal of
irrelevant words
data
How to make the results more useful and actionable?
● # of topic for convergence
● Time and performance
tradeoff
● Other ”Deep NLP” model ?
Word2vec
GloVe
Fasttext

Product owner: Huaixiu Zheng and Yichia Wang, Hugh Williams in Uber’s Applied Machine Learning team
Project #2:
Artificial Intelligence revolution in call centers

CSR’s sample workflow to respond user in a call center
How does our users submit an issue?

CSR’s sample workflow to respond user in a call center
Online support via in-app-help

The issue for call center operation: scalability and cost
The growth comes at a price again….

Solution? Let’s start from a basic sample
“I want to change my rating for a rider” - very rule-based deterministic flow

The business impact of a simple bot-solving solution
3k+ weekly solves
A team of
18 CSR
28k USD
monthly

What’s the problem with this solution?
“Scalability”

The difference between Programming and Machine Learning

Our machine learning solution design
Why go with “Semi-automated” assistance rather than real robot?
Product designed by Hugh Williams, Huaixiu Zheng, Yi-Chia Wang in Applied Machine Learning team

Our machine learning solution design
‘Assistant to CSR’ - Provide suggestions for reply and actions
Issue category/ type suggestion
Action suggestion
10M+ tickets
Correct response from
agents to these 10M+
tickets
Technical model training Product design

Typical Machine Learning process
Note: picture from “Mark Peng’s “General Tips for participating Kaggle Competitions” on Slideshare

Typical Machine Learning process
Model selection
ML 101:
Start with simple model first
Data source: https://eng.uber.com/cota-v2/

Deep Learning Architecture
Reference: Uber AML Lab: http://eng.uber.com/cota
1000+ multiclass problem
10M English tickets (10-day)

Sample code with Keras for a simple CNN

arXiv Paper: COTA: Improving the Speed and Accuracy of Customer Support through Ranking and Deep Networks (link)
CNN: Max pooling
Optimizers: Adam (SGD, RMSProp), Batch Normalization
Regularization: L2 Reg, Dropout, Batch Normalization, early stopping

Development environment for Deep learning model training
How does model training look like?
>> DEMO
Main codebase + data set
GRID K520

Feature engineering and feature importance
Trade off between capacity and interpretability
“Capacity”
“Interpretability”

What are the important features? Very easy to learn that in simpler model

What are the important features? Very easy to get explanation in simpler models

What are the important features? NN model is like our brain’s intuition … blackbox

Feature engineering process
What are the important features?
Trick: 資料量太大, 重新跑模型很久 →
把”測試資料”裡面的一個個feature打亂以快速得到結果

Sklearn: Recursive feature elimination
(sklearn.feature_selection.RFE)
Mockup
dataset

Time on model training >>> prediction
Shuffle each feature to create noise…. on the testing set
Mockup
dataset

Shuffle each feature to create noise…. on the testing set
Mockup
example

Why NumPy is faster?
Python Vectorization: Single Instruction, Multiple Data (SIMD)

Why NumPy is faster?
Python Vectorization: Locality of reference (Spatial Locality)
Java/ C++ versus Python…...

Issue category suggestion
Action suggestion
Product design
Last stop: making business Impact
Ensure KPI measurement is well-planned in the beginning

Last stop: making business Impact
Identify key business metrics, and cautiously conduct and monitor experiments
Source: https://eng.uber.com/cota-v2/
Experiment notes:
* Network effect → Switch
back instead of A/B test
* Guardrail variable and
decision variable (risk control)
* Monitoring versus peeking
* Novelty effect

Other leanings
How to become a better programmer, or data scientist?

Other leanings
How to become a better programmer, or data scientist?
● Long-term growth: Not just know how to call APIs →
○ Understand what’s happening beneath (math and low-level
manipulation are key)
○ Understand pros and cons of your tool/ model/ framework
choice
● Coding at scale: Resource and infra are rich, but data is also
huge (as well as the risk) → time and space optimization
optimization but not overdesign
● Communication: Everybody is busy → organize and
communicate your work well, and build good social relationship

Recommended reading 推薦閱讀
How to become a better programmer, or data scientist? 多看書，多寫扣，多分享
Data Science from Scratch: 用python學資料科學
這本很推薦，也可以嘗試看原文的版本
Python資料運算與分析實戰
Numpy, Scipy, Pandas
日本人寫程式的書也很厲害...
流暢的Python
Java我的推薦聖經是Effective Java
這本可能還沒到那個程度，但也推薦!

Recommended reading
How to become a better programmer, or data scientist? Read & Code & Share, and repeat
Machine Learning and Deep Learning with Python
Focus on scikit-learn and TensorFlow
Data Science from Scratch
Highly recommend: Python-based hand-in-hand
On classical concepts and algorithms

Paul Lo
Data Analytics @ Uber
paul.lo *a*t uber.com | paullo0106 a*t* gmail.com
Q&A

[Taipei.py] improving user experience with text mining and deep learning in Uber

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to [Taipei.py] improving user experience with text mining and deep learning in Uber

Similar to [Taipei.py] improving user experience with text mining and deep learning in Uber (20)

Recently uploaded

Recently uploaded (20)

[Taipei.py] improving user experience with text mining and deep learning in Uber