Deep Learning for Natural Language Processing Using Apache Spark and TensorFlowwith Alexis Roos and Wenhao liu
1. Deep Learning for Natural Language
Processing Using Apache Spark and
TensorFlow
Alexis Roos – Director Machine Learning @alexisroos
Wenhao Liu – Senior Data Scientist
Activity Intelligence team
3. This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions
proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other
than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other
financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded
services or technology developments and customer contracts or use of our services.
The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and
services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of
our security measures, the outcome of any litigation, risks associated with completed and any possible mergers and acquisitions, the immature market in which we operate, our
relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer
deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that
could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-K for the most recent fiscal year and in our quarterly report on Form 10-Q for
the most recent fiscal quarter. These documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our
Web site.
Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or
at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and
does not intend to update these forward-looking statements.
Statement under the Private Securities Litigation Reform Act of 1995
Forward-Looking Statement
4. Doing Well and Doing Good
#1 World’s Most
Innovative Companies
Best Places to Work
for LGBTQ Equality
#1 The World’s Best
Workplaces
#1 Workplace for
Giving Back
#1 Top 50 Companies
that Care
The World’s Most
Innovative Companies
#1 The Future 50
5. Salesforce Keeps Getting Smarter with Einstein
Guide Marketers
Einstein Engagement Scoring
Einstein Segmentation (pilot)
Einstein Vision for Social
Assist Service Agents
Einstein Bots (pilot)
Einstein Agent (pilot)
Einstein Vision for Field Service (pilot)
Coach Sales Reps
Einstein Forecasting (pilot)
Einstein Lead & Opportunity Scoring
Einstein Activity Capture
Advise Retailers
Einstein Product Recommendations
Einstein Search Dictionaries
Einstein Predictive Sort
Empower Admins & Developers
Einstein Prediction Builder (pilot)
Einstein Vision & Language
Einstein Discovery
Help Community Members
Einstein Answers (pilot)
Community Sentiment (pilot)
Einstein Recommendations
Austin Buchan
CEO, College Forward
7. Enhance CRM experience using AI and activity
Suggest
Action(s)
Insights:
Pricing discussed, Executive
involved, Scheduling Requested,
Angry email, competition mentioned,
etc.
AI Inbox
Timelines
Other Salesforce
Apps
…
Einstein
Activity
Capture
Extract
Insights
Emails,
meetings,
tasks,
calls, etc
Email classification use case
8. What types of emails do Sales users receive?
• Emails from customers
• Scheduling requests, pricing requests, competitor mentioned, etc.
• Emails from coworkers
• Marketing emails
• Newsletters
• Telecom, Spotify, iTunes, Amazon purchases
• etc
9. Scheduling requests
We want to identify scheduling requests from customers
Hi Alexis,
Can we get together
Thursday afternoon?
Best,
John
Hello Wenhao,
Can you send me that
really important
document?
Thanks,
Mark
Welcome to
Business review!
Your subscription is
active.
Your next letter will
be emailed on May
25th 2018.
10. Before scoring: filtering and parsing
• Right language
• Automated vs non automated
• Inbound / outbound
• Within or outside the organization
• etc
INTRO
SIGNATURE
CONFIDENTIALITY NOTICE
REPLY CHAIN
BODY
Hey Alexis,
Let’s meet with Ascander on Friday to discuss
the $10,000/year rate. Ascander’s phone
number is (123) 456-7890.
Thanks,
Noah Bergman
Engineer at Salesforce
(123) 456-7890
The contents of this email and any attachments
are confidential and are intended solely for
addressee…
From: Alexis alexis@salesforce.com
Date: April 1, 2017
Subject: Important Document
Noah, how much does your product cost?
HEADER INFORMATION ...
11. Steps:
• Normalize and tokenize
• Generate n-grams
• Remove stop words
• Compute TF with min threshold filter based vocabulary size
• Compute IDF and filter n-grams based on IDF threshold
“Basic” NLP text classifier
Shortcomings:
• Lack of generalization as classifier is limited to tokens from training data
• Collection of n-grams doesn’t take into account ordering or sequences
12. Word2Vec or GloVe
• Unsupervised learning algorithm for obtaining vector
representations for words.
• Training is performed on aggregated global word-word
co-occurrence statistics from a corpus.
word2VecModel.findSynonyms(“cost”, 5)
MONEY
price
license
nominal
budget
• Word vectors for individual tokens capture the semantic.
13. High-level Architecture
Our current machine learning pipeline is pure Scala / Spark, which has served us well.
Filtering
Text
Preprocessing
Raw Emails
Filtered
Emails
FeatureExtraction
Word2Vec
LDA
TF/IDF
N-gram
Or…
other ML models implemented in
Scala/Spark
16. What are Neural Networks: recurring networks
“I grew up in France… I speak fluent French.”
17. • RNNs suffer from vanishing or exploding gardients
• LSTM allow to chain and store and use memory across
sequence controlled through gates and operations
• Designed to be chained into RNN.
LSTM
https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714
19. High Level Model Architecture
We present a “simple” BiLSTM model for text classification.
x0 x1 x2 x3
Ob
0 Of
3
• Tokens are mapped into word embeddings (GloVe
pretrained on Wikipedia)
• The word embedding for each token is fed into both
forward and backward recurrent network with LSTM (Long
Short-Term Memory*) cells
• “Last” output of the forward and backward RNNs are
concatenated and taken as input by the sigmoid unit for
binary classification
* Hochreiter & Schmidhuber 1997
Cb
0 Cb
1 Cb
2 Cb
3
Backward
Cf
0 Cf
1 Cf
2 Cf
3
Forward
20. Detailed Considerations for the Model
• We applied dropout on recurrent connections* and inputs…
• As well as L2 regularization on the model parameters.
About dropout and regularization
Cf
0 Cf
1 Cf
2 Cf
3
x0 x1 x2 x3
Cb
0 Cb
1 Cb
2 Cb
3
Ob
0 Of
3
trainable_vars = tf.trainable_variables()
regularization_loss = tf.reduce_sum(
[tf.nn.l2_loss (v) for v in trainable_vars])
loss = original_loss
+ reg_weight * regularization_loss
*Gal & Ghahramani NIPS 2016
21. Emails come in different lengths, and some are extremely short while others are long
• One-word email: “Thanks”
• 800+ words long emails are also commonly seen in business emails
Detailed Considerations for the Model
About variable sequence lengths
tf.nn.dynamic_rnn(
cell=lstm_cell,
inputs=input_data,
sequence_length=seq_len
)
Solution: dynamic_rnn + max length + sequence sampling
• tf.nn.dynamic_rnn (or tf.nn.bidirectional_dynamic_rnn) allows for variable lengths for input sequences
22. Other Model Architectures Considered
• Single-direction RNN
• Single-direction RNN with GRU
• Single-direction RNN with LSTM
• Average pooling for outputs
• Max pooling for all outputs
• CNN on top of outputs
• …
We “settled” on current architecture through lots of experiments and considerations.
24. Our workflow around Spark is completely in Scala/Spark stack
• Train a SparkML model in the notebooking environment and save it out
• At scoring state, load the pretrained SparkML model (part of SparkML Pipeline), and call the transform method
Question: Can we use a TF model as if it was a native Scala/Spark function?
Fitting a TensorFlow model into a Spark pipelineFiltering
Text
Preprocessing
Raw Emails
Filtered
Emails
FeatureExtraction
Word2Vec
LDA
TF/IDF
N-gram
Or…
other ML models implemented in
Scala/Spark
25. Scala/Spark Pipeline + TensorFlow Model
TensorFrames / SparkDL as Interface
BatchSize
Sequence Length
Embedding Length
VocabularySize
tf.nn.embedding_lookup
Sequence
Length
BatchSize
Em
bedding
Length
[[10 19853 3920 8425 43 … 18646]
[235 489 165638 46562 … 16516]]
…
[[0.19853 0.3920 0.8646 0.459 … 0.1865]
…
[0.684 0.1894 0.1564 0.9874 … 0.354] ]
Filtering
Text
Preprocessing
Raw Emails
Filtered
Emails
Embedding Matrix
Encoded Input
Input Tensor
* Shi Yan, Understanding LSTM and its diagrams
26. TensorFrames turns a TensorFlow model into a UDF.
Save the model:
Save –> Load –> Score
%python
graph_def = tfx.strip_and_freeze_until(["input_data", "predicted"], sess.graph, sess = sess)
tf.train.write_graph(graph_def, “/model”, ”model.pb", False)
%scala
val graph = new com.databricks.sparkdl.python.GraphModelFactory()
.sqlContext(sqlContext)
.fetches(asJava(Seq("prediction")))
.inputs(asJava(Seq("input_data")), asJava(Seq(”input_data")))
.graphFromFile("/model/model.pb")
graph.registerUDF("model")
Score with the model:
Load the model:
%scala
val predictions = inputDataSet.selectExpr("InputData", "model(InputData)")
29. Lessons Learned
• A well-tuned LSTM model can outperform traditional ML approaches
• But data preparation is still needed and key to success
• Spark can play nicely with TensorFlow using TensorFrame as interface
• We can do end-to-end in single notebook and mix Spark/Scala with TF/Python
• Model outperforming ML approach and is being productized