This document provides an overview of representation learning techniques used at Red Hat, including word2vec, doc2vec, url2vec, and customer2vec. Word2vec is used to learn word embeddings from text, while doc2vec extends it to learn embeddings for documents. Url2vec and customer2vec apply the same technique to learn embeddings for URLs and customer accounts based on browsing behavior. These embeddings can be used for tasks like search, troubleshooting, and data-driven customer segmentation. Duplicate detection is another application, where title and content embeddings are compared. Representation learning is also explored for baseball players to model player value.
Exploring the Future Potential of AI-Enabled Smartphone Processors
Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017
1. REPRESENTATION
LEARNING @ RED HAT
Michael A. Alcorn (malcorn@redhat.com)
Machine Learning Engineer - Information Retrieval
https://sites.google.com/view/michaelaalcorn/
1
3. Background
Why?
Small amount (zero?) of labeled data for task
Lots of unlabeled data (labeled data for a different
task?)
Can we use large amounts of unlabeled data to make
better predictions?
Not the same as traditional unsupervised learning!
in Goodfellow et al.'s Deep Learning
textbook
by Bengio et al.
Representation learning
Transfer learning
Excellent chapter
Article
3
6. word2vec
Analogies
"x is to y as ? is to z" x - y + z = ?
bash - shellshock + heartbleed = openssl
firefox - linux + windows = internet_explorer
openshift - cloud + storage = gluster
rhn_register - rhn + rhsm = subscription-
manager
=+—
6
7. Naming Colors
mapping RGB values to
color names
Results are pretty underwhelming for those in the
know
Can word embeddings improve ( )?
Blog post by Janelle Shane
GitHub
7
8. url2vec
Tasks concerning URLs
Search - returning relevant content
Troubleshooting - recommending related articles
Obvious method - look at text
Alternative/enhanced method - use customer
browsing behavior as additional contextual clues
8
15. Duplicate Detection
There are a number of "duplicate" KCS solutions on
the Customer Portal
Muddy search results
How can we identify candidate duplicate documents?
Obvious approach - compare text (e.g., tf-idf)
Bag-of-words loses any structural meaning behind text
Can we learn better representations?
Title is essentially a summary of the solution content
Learn representations of body that are similar to title
representations (like the DSSM; )my code
15
16. Deep Semantic Similarity Model
Jianfeng Gao - " "Deep Learning for Web Search and Natural Language Processing
16
17. (batter|pitcher)2vec ( )GitHub
Can we learn meaningful representations of MLB
players?
Accurate representations could be used to simulate
games and inform trades
Find undervalued/overvalued players
17
18. Can we learn meaningful representations of MLB
players?
Accurate representations could be used to simulate
games and inform trades
Find undervalued/overvalued players
(batter|pitcher)2vec ( )GitHub
18
19. Can we learn meaningful representations of MLB
players?
Accurate representations could be used to simulate
games and inform trades
Find undervalued/overvalued players
SI.com NBCSports.com
=+— LR
(batter|pitcher)2vec ( )GitHub
19