4. Search Relevance Feature Types
• static document priors
• query intent class labels
• query entities
• query / doc text similarity
• personalization (p18n)
• clickstream
• (example Solr query which demonstrates all of these omitted
because it doesn’t fit on this slide)
5. Agenda: getting down to business
• Personalized Search and the Clickstream
• Deep Learning To Rank
• Embeddings
• Text encoding
• p18n
• clickstream
• Objective functions
• Distributed vs Local training
• Query time inference
• Deep Tokenization for Lucene
6. DL4IR: How I learned to stop worrying and
love deep neural networks
• Non-reasons:
• Always the best ranking results
• c++/CUDA under the hood => superfast inference
• “default” model works OOTB
• My reasons, as a data engineer:
• Extremely modular, unified framework
• Easily updatable models
• GPU => fewer distributed systems
• Domain Knowledge + Feature Engineering => Naive Vectorization +
Network Architecture Engineering
7. DL4IR: Why?
• Extremely modular, unified framework. DL models are:
• dissectible: reusable sub-modules
• composable: inputs to other models
• Easily updatable models
• ok, maybe not “easy”
• (because transfer learning is hard)
• GPU => fewer distributed systems
• GPU=supercomputer, CUDA already written
• Feature Engineering is not repeatable:
• Architecture Engineering is (more or less)
• in DL, features aren’t free, but are learned
8. Agenda: Deep LTR
• Deep Learning to Rank
• Embeddings:
• pre-trained
• from scratch
• fine tuned
• Text encoding
• P18n: userId embeddings
• clickstream: docId embeddings
• Objective functions
• Distributed vs Local training
• Query-time inference
9. Embeddings
• Pre-trained text embeddings:
• GloVe (https://nlp.stanford.edu/projects/glove/)
• NNLM on Google news (https://tfhub.dev/google/nnlm-en-dim128/1)
• fastText (https://fasttext.cc)
• ELMo (https://tfhub.dev/google/elmo/2)
• From scratch
• Many parameters -> lots of training data
• Can be unsupervised first, then treated as above
• Fine-tuned
• Start w/ pre-trained, w/ trainable=False
• Train as usual, but not to convergence
• Re-start training with trainable=True + lower training rate
10. Embeddings: keras code
Pre-trained embeddings as numpy array of dense vectors (indexed
by token-id), just start building your model like so:
After training, the embedding will be saved with your model, and
you can also extract it out:
11. Agenda
• Deep Learning to Rank
• Embeddings
• Text encoding:
• chars vs words
• CNNs vs LSTMs
• P18n: userId embeddings
• clickstream: docId embeddings
• Objective functions
• Distributed vs Local training
• Query-time inference
12. Text encoding
• Characters vs Words:
• word embeddings require lots of data
• Millions of parameters => many GB of training data
• needs good tokenization + preprocessing
• (same in data sci pipeline / at query time!)
• Try char sequences instead!
• sometimes works for “old” ML
• works on small data
• on raw byte streams (no tokenizers)
• not my clever trick (c.f Zhang, Zhao, LeCun ‘15)
13. 1d-CNNs vs LSTMs: both operate on sequences
CNN: Convolutional Neural Network: 2d for images, 1d for text
LSTM: Long Short-Term Memory: updates state as it reads, can emit
sequence of states at each position as input for another LSTM:
14. LSTMs are “better”, but I CNNs
• LSTMs for text:
• A little harder to understand (boo!)
• (black box)-ish, not much to dissect (yay/boo?)
• Many parameters, needs big data (boo!)
• Not GPU-friendly -> slow to train (boo!)
• Often works OOTB w/ no tuning (yay!)
• Typically SOTA quality after significant tuning (yay!)
• CNNs for text:
• Fairly simple to understand (yay!)
• Easily dissectible (yay!)
• Few parameters, requires less training data (yay!)
• GPU-friendly -> super fast to train (yay!)
• Many many hyperparameters -> hard to tune (boo!)
• Currently not SOTA (boo!) but aren’t far off (yay!)
• Typically requires more code (boo!)
17. p18n features
• Deep Learning to Rank
• Embeddings
• Text encoding
• p18n: userId embeddings
• pre-trained RecSys (ALS) model
• from scratch w/ hashing trick
• clickstream: docId embeddings
• objective functions
• Distributed vs Local training
• Query-time inference
18. p18n: pre-trained “embeddings” vs hashing trick
ALS matrix decomposition as “pre-trained embedding”
from collaborative filtering:
or: just hash UIDs to O(1k) dim (4x: avoid total
collisions) and learn an O(1k) x O(100) embedding for
them
19. Clickstream features
• Deep Learning to Rank
• Embeddings
• Text encoding
• p18n: userId embeddings
• clickstream: docId embeddings
• same as for userId!
• can overfit easily
• “memorizing” query/doc history
• (which is sometimes ok…)
• Objective functions
• Distributed vs Local training
• Query-time inference
22. Agenda
• Deep Learning to Rank
• Embeddings
• Text encoding
• p18n: userId embeddings
• clickstream: docId embeddings
• Objective functions:
• Sentiment
• Text classification
• Text generation
• Identity function
• Ranking
• Distributed vs Local training
• Query-time inference
23. non-classification objectives
• Text generation: Neural Network Language Models (NNLM)
• Predict the next character/word from text
• Identity function: Autoencoder
• Predict the input as output
• Search Ranking: score(query, doc)
• query -click-> doc => score = 1
• query -no-click-> doc => score = 0
• better w/ triplets + “curriculum learning”:
• Start with random “no-click” pairs
• Later, pick docs Solr returns for query
• (but got no clicks!)
• eventually: docs w/ less clicks than expected
• (known as “hard negative mining”)
24. Agenda
• Deep Learning to Rank
• Embeddings
• Text encoding
• p18n
• clickstream
• Distributed vs Local training
• Query-time inference
25. Agenda
• Deep Learning to Rank
• Embeddings
• Text encoding
• p18n
• clickstream
• Distributed vs Local training
• Query-time inference
• Ideally: minimal pre/post-processing
• beware of finicky tensor mappings!
• jvm: MLeap TF support
29. Agenda
• Personalized Search and the Clickstream
• Deep Learning to Rank
• Deep Tokens for Lucene
• char-CNN internals
• LSH for discretization
• Hierarchical semantic tokenization
30. Deep Tokens
• What does a 1d-CNN consume/emit?
• Consumes a sequence (length n) of k-dim vectors
• Emits a sequence of (length n) of f-dim vectors
• (assuming sequences are pre+post-padded)
• If a CNN layer’s windows are w-wide, require:
• w*k*f parameters (plus biases)
• Activations are often ReLU: >= 0 w/lots of 0’s
31. Deep Tokens: intermediate layers
• 1d-CNN feature-vectors
• Consumes a sequence (length n) of k-dim vectors
• Emits a sequence of (length n) of f-dim vectors
• (assuming sequences are pre+post-padded)
• If a CNN layer’s windows are w-wide, require:
• w*k*f parameters (plus biases)
• Activations are often ReLU: >= 0 w/lots of 0’s
• How to get this data?
• activs = [enc.layer[3], enc.layer[5]]
• extractor = Model(input=enc.inputs, output=activs)
32. 1d-char CNN feature vectors by layer
• layer 0:
• Learns simple features like word suffixes, simple morphology, spacing, etc
• layer 1:
• slightly more features like word roots, articles, pronouns, etc
• layer 2:
• complex features: words + common misspellings, hyphenations/concatenations
• layer n:
• Every time you pool + stride over previous layer, effective window grows by factor of
pool_size
33. How deep can a char-CNN go?!?
• “Very Deep Convolutional Networks for Text Classification”,
Conneau, Schwenk, LeCun, Barrault; ’17
• very small (3char) windows, low filter count (64) early on
• “temporal version” of VGG architecture
• 29 layers, input as long as 1k chars
• Trained on 100k-3M docs
• 2.5 days on single GPU
• (I don’t know if this works for ranking)
34. • Locality Sensitive Hash to int codes
• dense vector becomes 16-24 bit int
• text => List[Int] at each layer
• Layer 0: same length as input
• Layer N+1 after k-pooling: len(layer_n.output)/k
• Indexing List[Int] is easy!
• “makes sense” to an inverted index
• Query time
• Query => List[Int] per layer
• search as usual (with sparsity!)
What can we do with these vectors?
35. LSH in 30 seconds:
• Random projections preserve
distances on account of:
• Johnson-Lindenstrauss
lemma
• Can pick totally random vectors
• Or: random sample of 2K
vectors from your dataset,
project via pi = vi - vi+1
36. Deep Tokens: sample similar char-ngrams
• Trained 7-layer char-CNN ranker on 3M BestBuy ecommerce clicks (from Kaggle)
• 64-256 feature maps
• quasi-“hard” negative mining by taking docs returned by Solr but with no clicks
• Example ngrams similar at layer 3-ish or so:
• similar: “ rin”, “e ri”, “rinf”
• From: “lord of the ring”, “LOTR extended edition dvd”, “lord of the rinfs extended”
• and:
• “0 in”, “0in “, “ nch”, “inch”
• From: “70 inch lcd”, “55 nch tv”, “90in sony tv”
• and:
• “s z 8”, “ zs8 “, “ sz8 ”, “lumix”
• From: “panasonic lumix s z 8”, “lumix zs8”, “panasonic dmc-zs8s”
• longer strings similar at layer 2 levels deeper:
• “10.1inches”, “lnch”, “inchplasma”, “inch”
• Still to do: full measurement of full DL ranking vs. approximate multilayer search on these
tokens, while sweeping the hyperparameter space and hashing strategies
37. Deep tokens: challenges
• Stability:
• Once model + LSH family is chosen, this is like “choosing an Analyzer” - changing requires
full reindex
• Hash functions which are “optimal” for one data set may be bad after indexing much more
data
• Similarity on differing scales with same semantics
• i.e. “55in” and “fifty five inch”
• (“shortcut” CNN connections needed?)
• Stop words
• want: no hash bucket (i.e. posting list) at any level have > 10% of corpus
• Noisy tokens at earlier levels (maybe never “index” first 3?)
• More generally
• precision vs. recall tradeoff tuning
38. Related work: Xu, et al, CNNs for Text Hashing (IJCAI ’15)
and many more (but none with as fun an acronym)
39. Deep Tokens: TL;DR
• configure model w/ deep char-CNN-based ranker w/search relevance loss
• Train it as usual
• Configure a convolutional feature extractor (CFE)
• From documents:
• Extract convolutional activations
• (learned textual features!)
• LSH -> discrete buckets (“abstract tokens”)
• Index these tokens
• At query time, use this CFE for:
• posting-list friendly deeply fuzzy search!
• (because really, just have a very fancy tokenizer)
• N.B. char-cnn models are small (O(100-300k) params