Deep Learning for Unified Personalized Search Recommendations (and Fuzzy Tokenization

Deep Learning for Unified
Personalized Search
Recommendations
(and Fuzzy Tokenization)
Jake Mannix
Chief Data Engineer, Lucidworks
@pbrane | in/jakemannix
#Activate18 #ActivateSearch

$whoami
• Now: Chief Data Engineer, Lucidworks
• Applied ML / relevance / RecSys
• data engineering
• Previously:
• Allen Institute for AI: research pub. semantic search
• Twitter: account search, user interest modeling, RecSys
• LinkedIn: profile search, generic entity-to-entity RecSys
• Prehistory:
• Other software dev.
• Algebraic topology, particle cosmology

Agenda
• Personalized Search and the Clickstream
• Deep Learning To Rank
• Deep Tokenization for Lucene

Search Relevance Feature Types
• static document priors
• query intent class labels
• query entities
• query / doc text similarity
• personalization (p18n)
• clickstream
• (example Solr query which demonstrates all of these omitted
because it doesn’t fit on this slide)

Agenda: getting down to business
• Deep Learning To Rank
• Embeddings
• Text encoding
• p18n
• clickstream
• Objective functions
• Distributed vs Local training
• Query time inference
• Deep Tokenization for Lucene

DL4IR: How I learned to stop worrying and
love deep neural networks
• Non-reasons:
• Always the best ranking results
• c++/CUDA under the hood => superfast inference
• “default” model works OOTB
• My reasons, as a data engineer:
• Extremely modular, unified framework
• Easily updatable models
• GPU => fewer distributed systems
• Domain Knowledge + Feature Engineering => Naive Vectorization +
Network Architecture Engineering

DL4IR: Why?
• Extremely modular, unified framework. DL models are:
• dissectible: reusable sub-modules
• composable: inputs to other models
• Easily updatable models
• ok, maybe not “easy”
• (because transfer learning is hard)
• GPU => fewer distributed systems
• GPU=supercomputer, CUDA already written
• Feature Engineering is not repeatable:
• Architecture Engineering is (more or less)
• in DL, features aren’t free, but are learned

Agenda: Deep LTR
• Deep Learning to Rank
• Embeddings:
• pre-trained
• from scratch
• fine tuned
• Text encoding
• P18n: userId embeddings
• clickstream: docId embeddings
• Query-time inference

Embeddings
• Pre-trained text embeddings:
• GloVe (https://nlp.stanford.edu/projects/glove/)
• NNLM on Google news (https://tfhub.dev/google/nnlm-en-dim128/1)
• fastText (https://fasttext.cc)
• ELMo (https://tfhub.dev/google/elmo/2)
• From scratch
• Many parameters -> lots of training data
• Can be unsupervised first, then treated as above
• Fine-tuned
• Start w/ pre-trained, w/ trainable=False
• Train as usual, but not to convergence
• Re-start training with trainable=True + lower training rate

Embeddings: keras code
Pre-trained embeddings as numpy array of dense vectors (indexed
by token-id), just start building your model like so:
After training, the embedding will be saved with your model, and
you can also extract it out:

Agenda
• Embeddings
• Text encoding:
• chars vs words
• CNNs vs LSTMs
• P18n: userId embeddings

Text encoding
• Characters vs Words:
• word embeddings require lots of data
• Millions of parameters => many GB of training data
• needs good tokenization + preprocessing
• (same in data sci pipeline / at query time!)
• Try char sequences instead!
• sometimes works for “old” ML
• works on small data
• on raw byte streams (no tokenizers)
• not my clever trick (c.f Zhang, Zhao, LeCun ‘15)

1d-CNNs vs LSTMs: both operate on sequences
CNN: Convolutional Neural Network: 2d for images, 1d for text
LSTM: Long Short-Term Memory: updates state as it reads, can emit
sequence of states at each position as input for another LSTM:

LSTMs are “better”, but I CNNs
• LSTMs for text:
• A little harder to understand (boo!)
• (black box)-ish, not much to dissect (yay/boo?)
• Many parameters, needs big data (boo!)
• Not GPU-friendly -> slow to train (boo!)
• Often works OOTB w/ no tuning (yay!)
• Typically SOTA quality after significant tuning (yay!)
• CNNs for text:
• Fairly simple to understand (yay!)
• Easily dissectible (yay!)
• Few parameters, requires less training data (yay!)
• GPU-friendly -> super fast to train (yay!)
• Many many hyperparameters -> hard to tune (boo!)
• Currently not SOTA (boo!) but aren’t far off (yay!)
• Typically requires more code (boo!)

1D CNN text encoder: keras code

1D CNN text encoder: layer shapes and sizes

p18n features
• Embeddings
• Text encoding
• p18n: userId embeddings
• pre-trained RecSys (ALS) model
• from scratch w/ hashing trick
• objective functions

p18n: pre-trained “embeddings” vs hashing trick
ALS matrix decomposition as “pre-trained embedding”
from collaborative filtering:
or: just hash UIDs to O(1k) dim (4x: avoid total
collisions) and learn an O(1k) x O(100) embedding for
them

Clickstream features
• Embeddings
• Text encoding
• same as for userId!
• can overfit easily
• “memorizing” query/doc history
• (which is sometimes ok…)

All together now: p18n query/doc CNN ranker

Agenda
• Embeddings
• Text encoding
• Objective functions:
• Sentiment
• Text classification
• Text generation
• Identity function
• Ranking

non-classification objectives
• Text generation: Neural Network Language Models (NNLM)
• Predict the next character/word from text
• Identity function: Autoencoder
• Predict the input as output
• Search Ranking: score(query, doc)
• query -click-> doc => score = 1
• query -no-click-> doc => score = 0
• better w/ triplets + “curriculum learning”:
• Start with random “no-click” pairs
• Later, pick docs Solr returns for query
• (but got no clicks!)
• eventually: docs w/ less clicks than expected
• (known as “hard negative mining”)

Agenda
• Embeddings
• Text encoding
• p18n
• clickstream

Agenda
• Embeddings
• Text encoding
• p18n
• clickstream
• Ideally: minimal pre/post-processing
• beware of finicky tensor mappings!
• jvm: MLeap TF support

want: simple model serving config:

MLeap source: TF integration
http://mleap-docs.combust.ml/
(also supports SparkML, sklearn,
xgboost, etc)

(…and now for something completely different)

Agenda
• Deep Tokens for Lucene
• char-CNN internals
• LSH for discretization
• Hierarchical semantic tokenization

Deep Tokens
• What does a 1d-CNN consume/emit?
• Consumes a sequence (length n) of k-dim vectors
• Emits a sequence of (length n) of f-dim vectors
• (assuming sequences are pre+post-padded)
• If a CNN layer’s windows are w-wide, require:
• w*k*f parameters (plus biases)
• Activations are often ReLU: >= 0 w/lots of 0’s

Deep Tokens: intermediate layers
• 1d-CNN feature-vectors
• Consumes a sequence (length n) of k-dim vectors
• Emits a sequence of (length n) of f-dim vectors
• (assuming sequences are pre+post-padded)
• If a CNN layer’s windows are w-wide, require:
• w*k*f parameters (plus biases)
• Activations are often ReLU: >= 0 w/lots of 0’s
• How to get this data?
• activs = [enc.layer[3], enc.layer[5]]
• extractor = Model(input=enc.inputs, output=activs)

1d-char CNN feature vectors by layer
• layer 0:
• Learns simple features like word suffixes, simple morphology, spacing, etc
• layer 1:
• slightly more features like word roots, articles, pronouns, etc
• layer 2:
• complex features: words + common misspellings, hyphenations/concatenations
• layer n:
• Every time you pool + stride over previous layer, effective window grows by factor of
pool_size

How deep can a char-CNN go?!?
• “Very Deep Convolutional Networks for Text Classification”,
Conneau, Schwenk, LeCun, Barrault; ’17
• very small (3char) windows, low filter count (64) early on
• “temporal version” of VGG architecture
• 29 layers, input as long as 1k chars
• Trained on 100k-3M docs
• 2.5 days on single GPU
• (I don’t know if this works for ranking)

• Locality Sensitive Hash to int codes
• dense vector becomes 16-24 bit int
• text => List[Int] at each layer
• Layer 0: same length as input
• Layer N+1 after k-pooling: len(layer_n.output)/k
• Indexing List[Int] is easy!
• “makes sense” to an inverted index
• Query time
• Query => List[Int] per layer
• search as usual (with sparsity!)
What can we do with these vectors?

LSH in 30 seconds:
• Random projections preserve
distances on account of:
• Johnson-Lindenstrauss
lemma
• Can pick totally random vectors
• Or: random sample of 2K
vectors from your dataset,
project via pi = vi - vi+1

Deep Tokens: sample similar char-ngrams
• Trained 7-layer char-CNN ranker on 3M BestBuy ecommerce clicks (from Kaggle)
• 64-256 feature maps
• quasi-“hard” negative mining by taking docs returned by Solr but with no clicks
• Example ngrams similar at layer 3-ish or so:
• similar: “ rin”, “e ri”, “rinf”
• From: “lord of the ring”, “LOTR extended edition dvd”, “lord of the rinfs extended”
• and:
• “0 in”, “0in “, “ nch”, “inch”
• From: “70 inch lcd”, “55 nch tv”, “90in sony tv”
• and:
• “s z 8”, “ zs8 “, “ sz8 ”, “lumix”
• From: “panasonic lumix s z 8”, “lumix zs8”, “panasonic dmc-zs8s”
• longer strings similar at layer 2 levels deeper:
• “10.1inches”, “lnch”, “inchplasma”, “inch”
• Still to do: full measurement of full DL ranking vs. approximate multilayer search on these
tokens, while sweeping the hyperparameter space and hashing strategies

Deep tokens: challenges
• Stability:
• Once model + LSH family is chosen, this is like “choosing an Analyzer” - changing requires
full reindex
• Hash functions which are “optimal” for one data set may be bad after indexing much more
data
• Similarity on differing scales with same semantics
• i.e. “55in” and “fifty five inch”
• (“shortcut” CNN connections needed?)
• Stop words
• want: no hash bucket (i.e. posting list) at any level have > 10% of corpus
• Noisy tokens at earlier levels (maybe never “index” first 3?)
• More generally
• precision vs. recall tradeoff tuning

Related work: Xu, et al, CNNs for Text Hashing (IJCAI ’15)
and many more (but none with as fun an acronym)

Deep Tokens: TL;DR
• configure model w/ deep char-CNN-based ranker w/search relevance loss
• Train it as usual
• Configure a convolutional feature extractor (CFE)
• From documents:
• Extract convolutional activations
• (learned textual features!)
• LSH -> discrete buckets (“abstract tokens”)
• Index these tokens
• At query time, use this CFE for:
• posting-list friendly deeply fuzzy search!
• (because really, just have a very fancy tokenizer)
• N.B. char-cnn models are small (O(100-300k) params

Thank you!
Jake Mannix
Chief Data Engineer, Lucidworks
@pbrane
#Activate18 #ActivateSearch

Deep Learning for Unified Personalized Search Recommendations (and Fuzzy Tokenization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Learning for Unified Personalized Search Recommendations (and Fuzzy Tokenization

Similar to Deep Learning for Unified Personalized Search Recommendations (and Fuzzy Tokenization (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

Deep Learning for Unified Personalized Search Recommendations (and Fuzzy Tokenization