SlideShare a Scribd company logo
1 of 41
Deep Learning for Unified
Personalized Search
Recommendations
(and Fuzzy Tokenization)
Jake Mannix
Chief Data Engineer, Lucidworks
@pbrane | in/jakemannix
#Activate18 #ActivateSearch
$whoami
• Now: Chief Data Engineer, Lucidworks
• Applied ML / relevance / RecSys
• data engineering
• Previously:
• Allen Institute for AI: research pub. semantic search
• Twitter: account search, user interest modeling, RecSys
• LinkedIn: profile search, generic entity-to-entity RecSys
• Prehistory:
• Other software dev.
• Algebraic topology, particle cosmology
Agenda
• Personalized Search and the Clickstream
• Deep Learning To Rank
• Deep Tokenization for Lucene
Search Relevance Feature Types
• static document priors
• query intent class labels
• query entities
• query / doc text similarity
• personalization (p18n)
• clickstream
• (example Solr query which demonstrates all of these omitted
because it doesn’t fit on this slide)
Agenda: getting down to business
• Personalized Search and the Clickstream
• Deep Learning To Rank
• Embeddings
• Text encoding
• p18n
• clickstream
• Objective functions
• Distributed vs Local training
• Query time inference
• Deep Tokenization for Lucene
DL4IR: How I learned to stop worrying and
love deep neural networks
• Non-reasons:
• Always the best ranking results
• c++/CUDA under the hood => superfast inference
• “default” model works OOTB
• My reasons, as a data engineer:
• Extremely modular, unified framework
• Easily updatable models
• GPU => fewer distributed systems
• Domain Knowledge + Feature Engineering => Naive Vectorization +
Network Architecture Engineering
DL4IR: Why?
• Extremely modular, unified framework. DL models are:
• dissectible: reusable sub-modules
• composable: inputs to other models
• Easily updatable models
• ok, maybe not “easy”
• (because transfer learning is hard)
• GPU => fewer distributed systems
• GPU=supercomputer, CUDA already written
• Feature Engineering is not repeatable:
• Architecture Engineering is (more or less)
• in DL, features aren’t free, but are learned
Agenda: Deep LTR
• Deep Learning to Rank
• Embeddings:
• pre-trained
• from scratch
• fine tuned
• Text encoding
• P18n: userId embeddings
• clickstream: docId embeddings
• Objective functions
• Distributed vs Local training
• Query-time inference
Embeddings
• Pre-trained text embeddings:
• GloVe (https://nlp.stanford.edu/projects/glove/)
• NNLM on Google news (https://tfhub.dev/google/nnlm-en-dim128/1)
• fastText (https://fasttext.cc)
• ELMo (https://tfhub.dev/google/elmo/2)
• From scratch
• Many parameters -> lots of training data
• Can be unsupervised first, then treated as above
• Fine-tuned
• Start w/ pre-trained, w/ trainable=False
• Train as usual, but not to convergence
• Re-start training with trainable=True + lower training rate
Embeddings: keras code
Pre-trained embeddings as numpy array of dense vectors (indexed
by token-id), just start building your model like so:
After training, the embedding will be saved with your model, and
you can also extract it out:
Agenda
• Deep Learning to Rank
• Embeddings
• Text encoding:
• chars vs words
• CNNs vs LSTMs
• P18n: userId embeddings
• clickstream: docId embeddings
• Objective functions
• Distributed vs Local training
• Query-time inference
Text encoding
• Characters vs Words:
• word embeddings require lots of data
• Millions of parameters => many GB of training data
• needs good tokenization + preprocessing
• (same in data sci pipeline / at query time!)
• Try char sequences instead!
• sometimes works for “old” ML
• works on small data
• on raw byte streams (no tokenizers)
• not my clever trick (c.f Zhang, Zhao, LeCun ‘15)
1d-CNNs vs LSTMs: both operate on sequences
CNN: Convolutional Neural Network: 2d for images, 1d for text
LSTM: Long Short-Term Memory: updates state as it reads, can emit
sequence of states at each position as input for another LSTM:
LSTMs are “better”, but I CNNs
• LSTMs for text:
• A little harder to understand (boo!)
• (black box)-ish, not much to dissect (yay/boo?)
• Many parameters, needs big data (boo!)
• Not GPU-friendly -> slow to train (boo!)
• Often works OOTB w/ no tuning (yay!)
• Typically SOTA quality after significant tuning (yay!)
• CNNs for text:
• Fairly simple to understand (yay!)
• Easily dissectible (yay!)
• Few parameters, requires less training data (yay!)
• GPU-friendly -> super fast to train (yay!)
• Many many hyperparameters -> hard to tune (boo!)
• Currently not SOTA (boo!) but aren’t far off (yay!)
• Typically requires more code (boo!)
1D CNN text encoder: keras code
1D CNN text encoder: layer shapes and sizes
p18n features
• Deep Learning to Rank
• Embeddings
• Text encoding
• p18n: userId embeddings
• pre-trained RecSys (ALS) model
• from scratch w/ hashing trick
• clickstream: docId embeddings
• objective functions
• Distributed vs Local training
• Query-time inference
p18n: pre-trained “embeddings” vs hashing trick
ALS matrix decomposition as “pre-trained embedding”
from collaborative filtering:
or: just hash UIDs to O(1k) dim (4x: avoid total
collisions) and learn an O(1k) x O(100) embedding for
them
Clickstream features
• Deep Learning to Rank
• Embeddings
• Text encoding
• p18n: userId embeddings
• clickstream: docId embeddings
• same as for userId!
• can overfit easily
• “memorizing” query/doc history
• (which is sometimes ok…)
• Objective functions
• Distributed vs Local training
• Query-time inference
All together now: p18n query/doc CNN ranker
Picture > 1k words
Agenda
• Deep Learning to Rank
• Embeddings
• Text encoding
• p18n: userId embeddings
• clickstream: docId embeddings
• Objective functions:
• Sentiment
• Text classification
• Text generation
• Identity function
• Ranking
• Distributed vs Local training
• Query-time inference
non-classification objectives
• Text generation: Neural Network Language Models (NNLM)
• Predict the next character/word from text
• Identity function: Autoencoder
• Predict the input as output
• Search Ranking: score(query, doc)
• query -click-> doc => score = 1
• query -no-click-> doc => score = 0
• better w/ triplets + “curriculum learning”:
• Start with random “no-click” pairs
• Later, pick docs Solr returns for query
• (but got no clicks!)
• eventually: docs w/ less clicks than expected
• (known as “hard negative mining”)
Agenda
• Deep Learning to Rank
• Embeddings
• Text encoding
• p18n
• clickstream
• Distributed vs Local training
• Query-time inference
Agenda
• Deep Learning to Rank
• Embeddings
• Text encoding
• p18n
• clickstream
• Distributed vs Local training
• Query-time inference
• Ideally: minimal pre/post-processing
• beware of finicky tensor mappings!
• jvm: MLeap TF support
want: simple model serving config:
MLeap source: TF integration
http://mleap-docs.combust.ml/
(also supports SparkML, sklearn,
xgboost, etc)
(…and now for something completely different)
Agenda
• Personalized Search and the Clickstream
• Deep Learning to Rank
• Deep Tokens for Lucene
• char-CNN internals
• LSH for discretization
• Hierarchical semantic tokenization
Deep Tokens
• What does a 1d-CNN consume/emit?
• Consumes a sequence (length n) of k-dim vectors
• Emits a sequence of (length n) of f-dim vectors
• (assuming sequences are pre+post-padded)
• If a CNN layer’s windows are w-wide, require:
• w*k*f parameters (plus biases)
• Activations are often ReLU: >= 0 w/lots of 0’s
Deep Tokens: intermediate layers
• 1d-CNN feature-vectors
• Consumes a sequence (length n) of k-dim vectors
• Emits a sequence of (length n) of f-dim vectors
• (assuming sequences are pre+post-padded)
• If a CNN layer’s windows are w-wide, require:
• w*k*f parameters (plus biases)
• Activations are often ReLU: >= 0 w/lots of 0’s
• How to get this data?
• activs = [enc.layer[3], enc.layer[5]]
• extractor = Model(input=enc.inputs, output=activs)
1d-char CNN feature vectors by layer
• layer 0:
• Learns simple features like word suffixes, simple morphology, spacing, etc
• layer 1:
• slightly more features like word roots, articles, pronouns, etc
• layer 2:
• complex features: words + common misspellings, hyphenations/concatenations
• layer n:
• Every time you pool + stride over previous layer, effective window grows by factor of
pool_size
How deep can a char-CNN go?!?
• “Very Deep Convolutional Networks for Text Classification”,
Conneau, Schwenk, LeCun, Barrault; ’17
• very small (3char) windows, low filter count (64) early on
• “temporal version” of VGG architecture
• 29 layers, input as long as 1k chars
• Trained on 100k-3M docs
• 2.5 days on single GPU
• (I don’t know if this works for ranking)
• Locality Sensitive Hash to int codes
• dense vector becomes 16-24 bit int
• text => List[Int] at each layer
• Layer 0: same length as input
• Layer N+1 after k-pooling: len(layer_n.output)/k
• Indexing List[Int] is easy!
• “makes sense” to an inverted index
• Query time
• Query => List[Int] per layer
• search as usual (with sparsity!)
What can we do with these vectors?
LSH in 30 seconds:
• Random projections preserve
distances on account of:
• Johnson-Lindenstrauss
lemma
• Can pick totally random vectors
• Or: random sample of 2K
vectors from your dataset,
project via pi = vi - vi+1
Deep Tokens: sample similar char-ngrams
• Trained 7-layer char-CNN ranker on 3M BestBuy ecommerce clicks (from Kaggle)
• 64-256 feature maps
• quasi-“hard” negative mining by taking docs returned by Solr but with no clicks
• Example ngrams similar at layer 3-ish or so:
• similar: “ rin”, “e ri”, “rinf”
• From: “lord of the ring”, “LOTR extended edition dvd”, “lord of the rinfs extended”
• and:
• “0 in”, “0in “, “ nch”, “inch”
• From: “70 inch lcd”, “55 nch tv”, “90in sony tv”
• and:
• “s z 8”, “ zs8 “, “ sz8 ”, “lumix”
• From: “panasonic lumix s z 8”, “lumix zs8”, “panasonic dmc-zs8s”
• longer strings similar at layer 2 levels deeper:
• “10.1inches”, “lnch”, “inchplasma”, “inch”
• Still to do: full measurement of full DL ranking vs. approximate multilayer search on these
tokens, while sweeping the hyperparameter space and hashing strategies
Deep tokens: challenges
• Stability:
• Once model + LSH family is chosen, this is like “choosing an Analyzer” - changing requires
full reindex
• Hash functions which are “optimal” for one data set may be bad after indexing much more
data
• Similarity on differing scales with same semantics
• i.e. “55in” and “fifty five inch”
• (“shortcut” CNN connections needed?)
• Stop words
• want: no hash bucket (i.e. posting list) at any level have > 10% of corpus
• Noisy tokens at earlier levels (maybe never “index” first 3?)
• More generally
• precision vs. recall tradeoff tuning
Related work: Xu, et al, CNNs for Text Hashing (IJCAI ’15)
and many more (but none with as fun an acronym)
Deep Tokens: TL;DR
• configure model w/ deep char-CNN-based ranker w/search relevance loss
• Train it as usual
• Configure a convolutional feature extractor (CFE)
• From documents:
• Extract convolutional activations
• (learned textual features!)
• LSH -> discrete buckets (“abstract tokens”)
• Index these tokens
• At query time, use this CFE for:
• posting-list friendly deeply fuzzy search!
• (because really, just have a very fancy tokenizer)
• N.B. char-cnn models are small (O(100-300k) params
Thank you!
Jake Mannix
Chief Data Engineer, Lucidworks
@pbrane
#Activate18 #ActivateSearch
References:
• Coming soon

More Related Content

What's hot

(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine LearningAmazon Web Services
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
Introduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep LearningIntroduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep LearningMadhu Sanjeevi (Mady)
 
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...Edureka!
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Introduction to Keras
Introduction to KerasIntroduction to Keras
Introduction to KerasJohn Ramey
 
Building NLP solutions using Python
Building NLP solutions using PythonBuilding NLP solutions using Python
Building NLP solutions using Pythonbotsplash.com
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecJosh Patterson
 
GDG-Shanghai 2017 TensorFlow Summit Recap
GDG-Shanghai 2017 TensorFlow Summit RecapGDG-Shanghai 2017 TensorFlow Summit Recap
GDG-Shanghai 2017 TensorFlow Summit RecapJiang Jun
 
Java Serialization Facts and Fallacies
Java Serialization Facts and FallaciesJava Serialization Facts and Fallacies
Java Serialization Facts and FallaciesRoman Elizarov
 
Anghami: From Billions Of Streams To Better Recommendations
Anghami: From Billions Of Streams To Better RecommendationsAnghami: From Billions Of Streams To Better Recommendations
Anghami: From Billions Of Streams To Better RecommendationsRamzi Karam
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JJosh Patterson
 
Statistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnStatistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnOlivier Grisel
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Lucidworks
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Sergey Karayev
 

What's hot (20)

(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning
 
NLP from scratch
NLP from scratch NLP from scratch
NLP from scratch
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Introduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep LearningIntroduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep Learning
 
Tensorflow vs MxNet
Tensorflow vs MxNetTensorflow vs MxNet
Tensorflow vs MxNet
 
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Introduction to Keras
Introduction to KerasIntroduction to Keras
Introduction to Keras
 
Building NLP solutions using Python
Building NLP solutions using PythonBuilding NLP solutions using Python
Building NLP solutions using Python
 
CBOR - The Better JSON
CBOR - The Better JSONCBOR - The Better JSON
CBOR - The Better JSON
 
TensorFlow 101
TensorFlow 101TensorFlow 101
TensorFlow 101
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
 
GDG-Shanghai 2017 TensorFlow Summit Recap
GDG-Shanghai 2017 TensorFlow Summit RecapGDG-Shanghai 2017 TensorFlow Summit Recap
GDG-Shanghai 2017 TensorFlow Summit Recap
 
Java Serialization Facts and Fallacies
Java Serialization Facts and FallaciesJava Serialization Facts and Fallacies
Java Serialization Facts and Fallacies
 
Anghami: From Billions Of Streams To Better Recommendations
Anghami: From Billions Of Streams To Better RecommendationsAnghami: From Billions Of Streams To Better Recommendations
Anghami: From Billions Of Streams To Better Recommendations
 
The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
Statistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnStatistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learn
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
 

Similar to Deep Learning for Unified Personalized Search Recommendations (and Fuzzy Tokenization

Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldMilo Yip
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
You didnt see it’s coming? "Dawn of hardened Windows Kernel"
You didnt see it’s coming? "Dawn of hardened Windows Kernel" You didnt see it’s coming? "Dawn of hardened Windows Kernel"
You didnt see it’s coming? "Dawn of hardened Windows Kernel" Peter Hlavaty
 
FP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit HoleFP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit HoleChristophe Grand
 
Introduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural NetworksIntroduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural NetworksMarcinJedyk
 
Introduction to computer vision
Introduction to computer visionIntroduction to computer vision
Introduction to computer visionMarcin Jedyk
 
Algorithm and Data Structures - Basic of IT Problem Solving
Algorithm and Data Structures - Basic of IT Problem SolvingAlgorithm and Data Structures - Basic of IT Problem Solving
Algorithm and Data Structures - Basic of IT Problem Solvingcoolpie
 
Intelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDBIntelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDBMihnea Giurgea
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptxssuserf583ac
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptxRohanBorgalli
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptxSreeVani74
 
10 minutes fun with Cloud API comparison
10 minutes fun with Cloud API comparison10 minutes fun with Cloud API comparison
10 minutes fun with Cloud API comparisonLaurent Cerveau
 
Networking Architecture of Warframe
Networking Architecture of WarframeNetworking Architecture of Warframe
Networking Architecture of WarframeMaciej Siniło
 

Similar to Deep Learning for Unified Personalized Search Recommendations (and Fuzzy Tokenization (20)

Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
You didnt see it’s coming? "Dawn of hardened Windows Kernel"
You didnt see it’s coming? "Dawn of hardened Windows Kernel" You didnt see it’s coming? "Dawn of hardened Windows Kernel"
You didnt see it’s coming? "Dawn of hardened Windows Kernel"
 
FP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit HoleFP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit Hole
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Introduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural NetworksIntroduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural Networks
 
Introduction to computer vision
Introduction to computer visionIntroduction to computer vision
Introduction to computer vision
 
Algorithm and Data Structures - Basic of IT Problem Solving
Algorithm and Data Structures - Basic of IT Problem SolvingAlgorithm and Data Structures - Basic of IT Problem Solving
Algorithm and Data Structures - Basic of IT Problem Solving
 
Intelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDBIntelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDB
 
DIY Java Profiling
DIY Java ProfilingDIY Java Profiling
DIY Java Profiling
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 
10 minutes fun with Cloud API comparison
10 minutes fun with Cloud API comparison10 minutes fun with Cloud API comparison
10 minutes fun with Cloud API comparison
 
Networking Architecture of Warframe
Networking Architecture of WarframeNetworking Architecture of Warframe
Networking Architecture of Warframe
 

More from Lucidworks

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceLucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesLucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchLucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondLucidworks
 

More from Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Recently uploaded

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Deep Learning for Unified Personalized Search Recommendations (and Fuzzy Tokenization

  • 1. Deep Learning for Unified Personalized Search Recommendations (and Fuzzy Tokenization) Jake Mannix Chief Data Engineer, Lucidworks @pbrane | in/jakemannix #Activate18 #ActivateSearch
  • 2. $whoami • Now: Chief Data Engineer, Lucidworks • Applied ML / relevance / RecSys • data engineering • Previously: • Allen Institute for AI: research pub. semantic search • Twitter: account search, user interest modeling, RecSys • LinkedIn: profile search, generic entity-to-entity RecSys • Prehistory: • Other software dev. • Algebraic topology, particle cosmology
  • 3. Agenda • Personalized Search and the Clickstream • Deep Learning To Rank • Deep Tokenization for Lucene
  • 4. Search Relevance Feature Types • static document priors • query intent class labels • query entities • query / doc text similarity • personalization (p18n) • clickstream • (example Solr query which demonstrates all of these omitted because it doesn’t fit on this slide)
  • 5. Agenda: getting down to business • Personalized Search and the Clickstream • Deep Learning To Rank • Embeddings • Text encoding • p18n • clickstream • Objective functions • Distributed vs Local training • Query time inference • Deep Tokenization for Lucene
  • 6. DL4IR: How I learned to stop worrying and love deep neural networks • Non-reasons: • Always the best ranking results • c++/CUDA under the hood => superfast inference • “default” model works OOTB • My reasons, as a data engineer: • Extremely modular, unified framework • Easily updatable models • GPU => fewer distributed systems • Domain Knowledge + Feature Engineering => Naive Vectorization + Network Architecture Engineering
  • 7. DL4IR: Why? • Extremely modular, unified framework. DL models are: • dissectible: reusable sub-modules • composable: inputs to other models • Easily updatable models • ok, maybe not “easy” • (because transfer learning is hard) • GPU => fewer distributed systems • GPU=supercomputer, CUDA already written • Feature Engineering is not repeatable: • Architecture Engineering is (more or less) • in DL, features aren’t free, but are learned
  • 8. Agenda: Deep LTR • Deep Learning to Rank • Embeddings: • pre-trained • from scratch • fine tuned • Text encoding • P18n: userId embeddings • clickstream: docId embeddings • Objective functions • Distributed vs Local training • Query-time inference
  • 9. Embeddings • Pre-trained text embeddings: • GloVe (https://nlp.stanford.edu/projects/glove/) • NNLM on Google news (https://tfhub.dev/google/nnlm-en-dim128/1) • fastText (https://fasttext.cc) • ELMo (https://tfhub.dev/google/elmo/2) • From scratch • Many parameters -> lots of training data • Can be unsupervised first, then treated as above • Fine-tuned • Start w/ pre-trained, w/ trainable=False • Train as usual, but not to convergence • Re-start training with trainable=True + lower training rate
  • 10. Embeddings: keras code Pre-trained embeddings as numpy array of dense vectors (indexed by token-id), just start building your model like so: After training, the embedding will be saved with your model, and you can also extract it out:
  • 11. Agenda • Deep Learning to Rank • Embeddings • Text encoding: • chars vs words • CNNs vs LSTMs • P18n: userId embeddings • clickstream: docId embeddings • Objective functions • Distributed vs Local training • Query-time inference
  • 12. Text encoding • Characters vs Words: • word embeddings require lots of data • Millions of parameters => many GB of training data • needs good tokenization + preprocessing • (same in data sci pipeline / at query time!) • Try char sequences instead! • sometimes works for “old” ML • works on small data • on raw byte streams (no tokenizers) • not my clever trick (c.f Zhang, Zhao, LeCun ‘15)
  • 13. 1d-CNNs vs LSTMs: both operate on sequences CNN: Convolutional Neural Network: 2d for images, 1d for text LSTM: Long Short-Term Memory: updates state as it reads, can emit sequence of states at each position as input for another LSTM:
  • 14. LSTMs are “better”, but I CNNs • LSTMs for text: • A little harder to understand (boo!) • (black box)-ish, not much to dissect (yay/boo?) • Many parameters, needs big data (boo!) • Not GPU-friendly -> slow to train (boo!) • Often works OOTB w/ no tuning (yay!) • Typically SOTA quality after significant tuning (yay!) • CNNs for text: • Fairly simple to understand (yay!) • Easily dissectible (yay!) • Few parameters, requires less training data (yay!) • GPU-friendly -> super fast to train (yay!) • Many many hyperparameters -> hard to tune (boo!) • Currently not SOTA (boo!) but aren’t far off (yay!) • Typically requires more code (boo!)
  • 15. 1D CNN text encoder: keras code
  • 16. 1D CNN text encoder: layer shapes and sizes
  • 17. p18n features • Deep Learning to Rank • Embeddings • Text encoding • p18n: userId embeddings • pre-trained RecSys (ALS) model • from scratch w/ hashing trick • clickstream: docId embeddings • objective functions • Distributed vs Local training • Query-time inference
  • 18. p18n: pre-trained “embeddings” vs hashing trick ALS matrix decomposition as “pre-trained embedding” from collaborative filtering: or: just hash UIDs to O(1k) dim (4x: avoid total collisions) and learn an O(1k) x O(100) embedding for them
  • 19. Clickstream features • Deep Learning to Rank • Embeddings • Text encoding • p18n: userId embeddings • clickstream: docId embeddings • same as for userId! • can overfit easily • “memorizing” query/doc history • (which is sometimes ok…) • Objective functions • Distributed vs Local training • Query-time inference
  • 20. All together now: p18n query/doc CNN ranker
  • 21. Picture > 1k words
  • 22. Agenda • Deep Learning to Rank • Embeddings • Text encoding • p18n: userId embeddings • clickstream: docId embeddings • Objective functions: • Sentiment • Text classification • Text generation • Identity function • Ranking • Distributed vs Local training • Query-time inference
  • 23. non-classification objectives • Text generation: Neural Network Language Models (NNLM) • Predict the next character/word from text • Identity function: Autoencoder • Predict the input as output • Search Ranking: score(query, doc) • query -click-> doc => score = 1 • query -no-click-> doc => score = 0 • better w/ triplets + “curriculum learning”: • Start with random “no-click” pairs • Later, pick docs Solr returns for query • (but got no clicks!) • eventually: docs w/ less clicks than expected • (known as “hard negative mining”)
  • 24. Agenda • Deep Learning to Rank • Embeddings • Text encoding • p18n • clickstream • Distributed vs Local training • Query-time inference
  • 25. Agenda • Deep Learning to Rank • Embeddings • Text encoding • p18n • clickstream • Distributed vs Local training • Query-time inference • Ideally: minimal pre/post-processing • beware of finicky tensor mappings! • jvm: MLeap TF support
  • 26. want: simple model serving config:
  • 27. MLeap source: TF integration http://mleap-docs.combust.ml/ (also supports SparkML, sklearn, xgboost, etc)
  • 28. (…and now for something completely different)
  • 29. Agenda • Personalized Search and the Clickstream • Deep Learning to Rank • Deep Tokens for Lucene • char-CNN internals • LSH for discretization • Hierarchical semantic tokenization
  • 30. Deep Tokens • What does a 1d-CNN consume/emit? • Consumes a sequence (length n) of k-dim vectors • Emits a sequence of (length n) of f-dim vectors • (assuming sequences are pre+post-padded) • If a CNN layer’s windows are w-wide, require: • w*k*f parameters (plus biases) • Activations are often ReLU: >= 0 w/lots of 0’s
  • 31. Deep Tokens: intermediate layers • 1d-CNN feature-vectors • Consumes a sequence (length n) of k-dim vectors • Emits a sequence of (length n) of f-dim vectors • (assuming sequences are pre+post-padded) • If a CNN layer’s windows are w-wide, require: • w*k*f parameters (plus biases) • Activations are often ReLU: >= 0 w/lots of 0’s • How to get this data? • activs = [enc.layer[3], enc.layer[5]] • extractor = Model(input=enc.inputs, output=activs)
  • 32. 1d-char CNN feature vectors by layer • layer 0: • Learns simple features like word suffixes, simple morphology, spacing, etc • layer 1: • slightly more features like word roots, articles, pronouns, etc • layer 2: • complex features: words + common misspellings, hyphenations/concatenations • layer n: • Every time you pool + stride over previous layer, effective window grows by factor of pool_size
  • 33. How deep can a char-CNN go?!? • “Very Deep Convolutional Networks for Text Classification”, Conneau, Schwenk, LeCun, Barrault; ’17 • very small (3char) windows, low filter count (64) early on • “temporal version” of VGG architecture • 29 layers, input as long as 1k chars • Trained on 100k-3M docs • 2.5 days on single GPU • (I don’t know if this works for ranking)
  • 34. • Locality Sensitive Hash to int codes • dense vector becomes 16-24 bit int • text => List[Int] at each layer • Layer 0: same length as input • Layer N+1 after k-pooling: len(layer_n.output)/k • Indexing List[Int] is easy! • “makes sense” to an inverted index • Query time • Query => List[Int] per layer • search as usual (with sparsity!) What can we do with these vectors?
  • 35. LSH in 30 seconds: • Random projections preserve distances on account of: • Johnson-Lindenstrauss lemma • Can pick totally random vectors • Or: random sample of 2K vectors from your dataset, project via pi = vi - vi+1
  • 36. Deep Tokens: sample similar char-ngrams • Trained 7-layer char-CNN ranker on 3M BestBuy ecommerce clicks (from Kaggle) • 64-256 feature maps • quasi-“hard” negative mining by taking docs returned by Solr but with no clicks • Example ngrams similar at layer 3-ish or so: • similar: “ rin”, “e ri”, “rinf” • From: “lord of the ring”, “LOTR extended edition dvd”, “lord of the rinfs extended” • and: • “0 in”, “0in “, “ nch”, “inch” • From: “70 inch lcd”, “55 nch tv”, “90in sony tv” • and: • “s z 8”, “ zs8 “, “ sz8 ”, “lumix” • From: “panasonic lumix s z 8”, “lumix zs8”, “panasonic dmc-zs8s” • longer strings similar at layer 2 levels deeper: • “10.1inches”, “lnch”, “inchplasma”, “inch” • Still to do: full measurement of full DL ranking vs. approximate multilayer search on these tokens, while sweeping the hyperparameter space and hashing strategies
  • 37. Deep tokens: challenges • Stability: • Once model + LSH family is chosen, this is like “choosing an Analyzer” - changing requires full reindex • Hash functions which are “optimal” for one data set may be bad after indexing much more data • Similarity on differing scales with same semantics • i.e. “55in” and “fifty five inch” • (“shortcut” CNN connections needed?) • Stop words • want: no hash bucket (i.e. posting list) at any level have > 10% of corpus • Noisy tokens at earlier levels (maybe never “index” first 3?) • More generally • precision vs. recall tradeoff tuning
  • 38. Related work: Xu, et al, CNNs for Text Hashing (IJCAI ’15) and many more (but none with as fun an acronym)
  • 39. Deep Tokens: TL;DR • configure model w/ deep char-CNN-based ranker w/search relevance loss • Train it as usual • Configure a convolutional feature extractor (CFE) • From documents: • Extract convolutional activations • (learned textual features!) • LSH -> discrete buckets (“abstract tokens”) • Index these tokens • At query time, use this CFE for: • posting-list friendly deeply fuzzy search! • (because really, just have a very fancy tokenizer) • N.B. char-cnn models are small (O(100-300k) params
  • 40. Thank you! Jake Mannix Chief Data Engineer, Lucidworks @pbrane #Activate18 #ActivateSearch