SlideShare a Scribd company logo
1 of 83
Making Reddit Search
Relevant and Scalable
Anupama Joshi
Senior Engineering Manager, Search
Jerry Bao
Senior Software Engineer, Search
Agenda
• What is Reddit?
• Search Architecture
• Improving our Relevance
• The History of Search @ Reddit
• Scaling our Infrastructure
• Q&A
What is Reddit?
Reddit is a network of communities where
individuals can find experiences built
around their interests, hobbies and
passions
It’s where people converse about the
things that are most important to them
Bring community and
belonging to everyone
Our mission
Reddit by the numbers
Alexa Rank (US/World)
MAU
Communities
Posts per day
Comments per day
Votes per day
Searches per day
5th/18th
400M+
1M+
440K+
3.5M+
82M+
68M+
So, what are
we doing with
all that
power?
Dog getting love
51.2k points (95% upvoted)
Cat Fist Bumping
137.1k points (90% upvoted)
817.2k views
Wait, it’s not just
cat/dog pictures!
Community > Content > Individual
● Authenticity
● Creative freedom
● Empathy @ scale
● Belonging
● Being heard
r/assistance
Empathy and support at scale
News Source
Reddit’s Community of Support
None of that matters if you can’t
FIND the content! So let’s talk about
Search...
User Retention
● New users who
searched are 300%
more likely to come
back between D1 to
D14
● > 50% of all mobile
users search
Search @ Reddit
Search Today: Architecture
Search Today: Architecture
Show and Tell: A better subreddit search
Challenge: Redditors are very creative in their subreddit naming (e.g. r/superbowl
is about superb owl pictures) which whilst fun, poses a challenge for discovery.
Answer: faceted search on posts!
Result: A better subreddit search
Result: A better subreddit search
Show and Tell: Better Post Search
● Post search with phrase matching of selftext
The challenge: What about images and link posts?
Answer - Comments
● Comments are important but which comments are most relevant to the post?
● How do we separate the signal from the noise?
Answer - HVT
● HVTs are the highest scoring tf-idf terms from comment sections.
● Index and match on these HVTs along with post selftexts and titles.
Result: Better Post Search
Qualitatively, we saw some users notice almost immediately when we first introduced HVTs.
For some queries, the difference is
quite stark. The following are
search results for the query
‘shabooya’. Note how ‘shabooya’
doesn’t appear anywhere in the title
or the body of the first three post
results, but you can see the phrase
show up in the comments.
Result: Better Post Search
● Post click through rate (CTR) (+3.15%),
● Relevancy ranking for navigational searches (MRR) (+4.01%)
● Search experience improvements for navigational searches due to increased
recall on posts with poor title or body text
Take It to the Next Level: Improve Search Relevance
● Learn from the users click statistics to automatically generate a relevancy
model
● Rerank Search results based on aggregated Click Signal weights that users
click higher on search results for a given query
○ Stream user events in Solr/Fusion cluster
○ Spark Jobs to aggregate click data
○ Use output from the aggregated signal to boost the search results
Result: Post search relevance using signals
7.5 % Increase in CTR12.5 % increase in MRR
Result: Subreddit search relevance using signals
Head-Tail Analysis
● Spelling corrections.
● Tail Query Rewriting.
● Specific Dictionary based Rewriting
Head-Tail Analysis
A tail query like “lot of credit card debit” would be rewritten to produce better relevant results.
Trending Searches
● Reddit can attribute week-over-week DAU
growth to external events, like game
releases, movie releases, and cultural
events (reference).
● We see similar upticks in searches based on
these events (reference).
● We believe that we can increase search
engagement and time on site by leveraging
these signals to highlight trending queries
to users when they search on Reddit.
NSFW Categorization
● Develop NSFW classification criteria
● Query Time classification based content filtering.
● Results boosting/reordering based on classification(boost or filter results
based on knowing the query does/does not have NSFW intent)
● Look at the NSFW results in recall
● Look at the NSFW results people clicked
● Try open source Tensorflow libraries for auto detection of NSFW which is not
marked NSFW
Related Searches
● Train a collaborative filtering matrix decomposition recommender using
SparkML's Alternating Least Squares (ALS) to batch compute query-query
similarities
● Related Searches backend based on Collaborative Filtering & Co Occurrence
Counting Algorithm via Temporal Proximity
● Collaborative filtering based recommender systems are a popular technique
applied for movie recommendations at Netflix, or product recommendations in
e-commerce sites like Amazon
Related Searches
● Dynamic temporal buckets as source of data.
● All pairs irrespective of number of distinct queries in Session
● Length & temporal distance metrics to help with boosting recommendation.
● Intuitive & easily explainable.
● Scales extremely well for building pluggable logic & adding more dimensions.
Related Searches
*Query* —> *Related Searches*
*learn* —> `learn programming`, `learn python`, `learn javascript`, `learn French`, `learn java`, `piano learn`
*cats* —> `cat`, `aww`, `dogs`, `r/cats`, `r/comics`, `kittens`, `funny`, `pets`
*dogs* —> `dog`, `dogs`, `aww`, `isle of dogs`, `isle of dogs discussion`, `cute dogs`, `pets`
*infinity war* —> `avengers`, `piracy`, `infinity war stream`, `infinity war hd`, `avengers infinity war`, `avengers infinity
war stream`, `deadpool2`, `infinity war torrent`
*coming out*. —> `gay`, `lgbt`
*makeup*. —> `beauty`, `make up`, `makeupaddiction`, `skincare`, `foundation`, `eyeshadow`, `wedding`
*keto* —> `snacks`, `r/keto`, `r/progresspics`, `xxketo`, `keto recipes`, `keto diet`, `fasting`
*programming* —> `r/politics`, `r/programming`, `programming`, `python`, `coding`, `learnprogramming`, `r/golang`,
`r/programming`
*Cohen* —> `sacha baron cohen`, `sasha baron cohen` `who is america`, `trump`, `jason spencer`, `sacha cohen`,
`sasha cohen`
*photography* —> `photo`, `r/Nikon`, `r/photography`, `camera`, `photos`, `art`, `r/bestof`, `instagram`
*blep*. —> `mlem`
Future Relevance Work
What’s next
● Contextual Query Understanding
○ how context informs query understanding
● Understanding User Intent
○ classifying the query by its interpretation. The interpretation of the query can then be used to
define intent
● Query rewriting and scoping
○ query rewriting technique that improves precision by matching each query segment to the right
attribute
○ query tagging (special case of named-entity recognition (NER))
Infrastructure and Scaling
Reddit Search has an
interesting history...
History of Reddit Search
History of Reddit Search
● 2005 - Steve Huffman, cofounder and now CEO, implements postgres tsearch.
● 2006 - Chris Slowe, founding engineer and now CTO, implements pylucene.
○ “we fixed a bug in the search results ordering” - Steve Huffman ‘06
○ “I made a quick fix to search that I hope helps until we get a chance to really fix it.” - Steve ‘07
● 2008 - David King, first employee and former search engineer, implements Solr.
○ “[David]’s been fixing search and hacking mystery projects in Erlang.” - Alexis Ohanian ‘08
○ “I’ve totally replaced the reddit search function.” - David King ‘08
● 2010 - David King replaces Solr with IndexTank.
○ “We launched a new search engine yesterday. Calm down. It’s okay. I know. You’ve been hurt
before.” - David King ‘10
● 2012 - u/kemitche implements CloudSearch after LinkedIn shut down IndexTank
“Q: Where do you see reddit in 10 years? A: Reddit search might work by then.” - Steve AMA ‘16
Redditors told us how
much they loved
Search...
“Reddit Search is great!” - said no redditor ever
“This image should honestly replace the 503 error (all servers busy) page.” - u/seven0feleven
“Ever since they moved away from scotch tape, I've been able to get irrelevant results in record time.” - u/El_Bandito_Blanquito
In 2017, we set out to
rebuild search from the
ground up!
Rebuilding Search
Our First Cluster
● Create an AMI with Solr and Fusion packages installed
● Spin up servers with custom AMI
● SSH into each server
○ Install Fusion and Solr
○ Edit configuration files
○ Increase file descriptor limit
● Configured in AWS US West
Our First Cluster
Our new cluster was up
and running well! We
immediately started work
on ingesting data and
relevance tuning.
But we ran into a
couple of key issues
when trying to scale
up...
Challenge #1
Issues with Scaling our Solr Cluster
● Adding capacity to our cluster or changing instance types took a lot of
effort
● Adding capacity our cluster meant that we needed to rebalance our
cluster so that our replicas were equally distributed across machines
○ Solr 7+ introduced some basic autoscaling features but lacked
policies to ensure a cluster was properly balanced
○ Rebalancing process was 100% manual
● Cross-region requests cost unnecessary latency
● As a result, our team was very cautious in scaling our cluster until it
was absolutely needed, to reduce the number of times we scaled up
Terraform and Puppet
everything!
Automate all the things!
Terraform + Puppet
● Together they allow us to programmatically make changes to
infrastructure and server configuration quickly
● We can describe how we want servers to be setup
○ Install Java and Solr
○ Mount drives and add user groups/permissions
○ Set up Solr configuration files
● Modifications to servers and infra are reviewable, and revertible
● Rollout changes across our fleet with ease
● “Can you add more servers Jerry??”
○ No problem! One line code change.
Terraforming Solr
Terraforming Solr
Terraforming Solr
Terraforming Solr
21
Terraforming Solr
Distributing Replicas in
Solr
Equally Distribute by Availability Zone
subreddits
shard 1
replica 1
solr-01
us-east-1a
replica 2
solr-02
us-east-1b
replica 3
solr-03
us-east-1c
shard 2
replica 1
solr-01
us-east-1a
replica 2
solr-02
us-east-1b
replica 3
solr-03
us-east-1c
shard 3
replica 1
solr-01
us-east-1a
replica 2
solr-02
us-east-1b
replica 3
solr-03
us-east-1c
No More Than 1 Replica From Same Shard
subreddits
shard 1
replica 1
solr-01
us-east-1a
replica 2
solr-02
us-east-1b
replica 3
solr-03
us-east-1c
shard 2
replica 1
solr-01
us-east-1a
replica 2
solr-02
us-east-1b
replica 3
solr-03
us-east-1c
shard 3
replica 1
solr-01
us-east-1a
replica 2
solr-02
us-east-1b
replica 3
solr-03
us-east-1c
Equally Distribute Collection’s Replicas
cluster
solr-01 (us-east-1a)
subreddits
shard 1
replica 1
posts
shard 1
replica 1
posts
shard 2
replica 1
solr-04 (us-east-1a)
posts
shard 3
replica 1
solr-02 (us-east-1b)
subreddits
shard 1
replica 2
posts
shard 1
replica 2
posts
shard 2
replica 2
solr-03 (us-east-1c)
subreddits
shard 1
replica 3
posts
shard 1
replica 3
posts
shard 2
replica 3
solr-05 (us-east-1b)
posts
shard 3
replica 2
solr-06 (us-east-1c)
posts
shard 3
replica 3
subreddits - 1 shard; posts - 2 shards; each shard has 3 replicas
Equally Distribute Cluster’s Replicas
cluster
solr-01 (us-east-1a)
subreddits
shard 1
replica 1
posts
shard 1
replica 1
solr-04 (us-east-1a)
posts
shard 3
replica 1
posts
shard 2
replica 1
solr-02 (us-east-1b)
subreddits
shard 1
replica 2
posts
shard 1
replica 2
solr-03 (us-east-1c)
subreddits
shard 1
replica 3
posts
shard 1
replica 3
solr-05 (us-east-1b)
posts
shard 3
replica 2
posts
shard 2
replica 2
solr-06 (us-east-1c)
posts
shard 3
replica 3
posts
shard 2
replica 3
subreddits - 1 shard; posts - 2 shards; each shard has 3 replicas
Solr Rebalancing Tool
● Applied balancing rules in order
○ Check each shard’s availability zone distribution and replica
distribution
○ Move replicas so that each collection’s replicas are on the most
amount of machines
○ Move replicas so that each machine has the least amount of
replicas possible
● Outputs list of operations to be performed and confirms with user each
replica to move
Solr Rebalancing Tool
Search Architecture Today
Cross-Region Latency Improvement
4x faster
queries!
Our cluster was now
scaling easily, but
reindexing all of our
data took many
weeks...
Challenge #2
Indexing Data for Search
● Backfills
○ Pulls data from our datasource
○ Transforms it into the schema we need for indexing
○ Used to add/remove/change field indexing
● Streaming
○ Captures real-time updates so up-to-date information can be
reflected in our indices
○ Transforms data the same way as backfills
Why are fast backfills important?
● Quickly iterate on document schemas
● Test new ways to analyze document fields
● Create multiple clusters of the same data for testing
● Fix data issues rapidly
Thing Data Model
Hive
● Pulled data from postgres with sqoop into Hive
● A series of transformations to
○ Join thing and data tables
○ Rotate the keys into columns
○ Store the final result as Parquet in S3
● Fusion/Spark fetched S3 files and indexed data into Solr
Issues with v1
● Several weeks to transform data
○ Afraid of changing the schema
● Many stages of transformation, making it hard to debug and figure out
how far upstream data transformation issues were
○ Hard to ensure the end result was correct
Thing Service
● Search Service as the transformer and indexer of data
○ Fetches the latest data from the Thing Service
● Special logic in Thing Service made it easier to handle postgres data
○ Score of links, comments
○ Converting to actual data types (booleans, fullnames)
● Cut backfill time from multiple weeks to a single week with
parallelization
Issues with v2
● Reliant upon a shared production service for what should be an offline
job
○ We’ve pushed the thing service too hard with our backfills,
affecting other services that rely upon it
● Other initiatives highlighted how slow our ingestion could get
○ HVTs (augmenting links with high value tokens from comments)
○ Attempts to index comment data
Spark
● Running our own postgres replicas from wal-e backups in S3
● Spark pulls data directly from postgres and transforms the data
● Can horizontally scale ingestion to be faster
○ Postgres to speed up ingestion of data into Spark
○ Spark to speed up transformation and joining of data
● We can adjust ingestion parallelism by repartitioning in the end
● Cut backfill time significantly from multiple weeks to days
Random 100% CPU
spikes prevented us
from shipping search
new features...
Challenge #3
Redditors Issue Expensive Queries
● High Recall Queries
○ the, would, you, ifs, news, games
● Crazy Queries
○ (AFD+OR+CDU+OR+CSU+OR+FDP+OR+Grünen+OR+SPD+OR+"
Die+Linke"+OR+Energiepolitik+OR+Gesetze~+OR+Kabinetts~+O
R+Regierungs~+OR+Referentenentwurf)+(Energiehandel~+OR+E
nergiemanagement~+OR+Energiepreis~+OR+Energiesteuer~)
● These queries would take multiple seconds to complete, blocking a
significant number of CPU cores in the cluster
Cutting Queries Off
● Utilize timeAllowed in solrconfig.xml to prevent expensive queries
taking up all of your cluster’s resources
○ NOTE: timeAllowed is not a hard cutoff. From the Solr docs:
○ As this check is periodically performed, the actual time for which a
request can be processed before it is aborted would be marginally
greater than or equal to the value of timeAllowed. If the request
consumes more time in other stages, e.g., custom components,
etc., this parameter is not expected to abort the request.
Future Scalability Work
Multi-Cluster Solr Environment
● One cluster per collection
● Hardware Isolation: one collections issues won’t affect other
collections
● Scale each collection independently
● Balancing becomes really simple
○ Each machine has equally distributed number of replicas
○ Ensure AZ and shard awareness
Solr 7.5 Autoscaling
● Solr 7.5 includes new policies that allow us to equally distribute
replicas by
○ Arbitrary properties
○ Collection
○ Cluster
● Turn Solr Scaling into a one step process
Questions?
Thank you!
Anupama Joshi
anupama@reddit.com
linkedin.com/in/anupamajoshi
Jerry Bao
jerry.bao@reddit.com
linkedin.com/in/thejerrybao
PS: We’re Hiring!
reddit.com/jobs

More Related Content

What's hot

Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Databricks
 
Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigmJim Dowling
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
 
Graphql presentation
Graphql presentationGraphql presentation
Graphql presentationVibhor Grover
 
[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI Ecosystem[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI EcosystemJiangjie Qin
 
Managed Search: Presented by Jacob Graves, Getty Images
Managed Search: Presented by Jacob Graves, Getty ImagesManaged Search: Presented by Jacob Graves, Getty Images
Managed Search: Presented by Jacob Graves, Getty ImagesLucidworks
 
A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...
A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...
A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...Databricks
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaDatabricks
 
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15MLconf
 
Validating credit cards on mobile using deep learning
Validating credit cards on mobile using deep learningValidating credit cards on mobile using deep learning
Validating credit cards on mobile using deep learningDataWorks Summit
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
 
The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines Jim Dowling
 
Introduction to GraphQL: Mobile Week SF
Introduction to GraphQL: Mobile Week SFIntroduction to GraphQL: Mobile Week SF
Introduction to GraphQL: Mobile Week SFAmazon Web Services
 
How web works and browser works ? (behind the scenes)
How web works and browser works ? (behind the scenes)How web works and browser works ? (behind the scenes)
How web works and browser works ? (behind the scenes)Vibhor Grover
 
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsScalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsSpark Summit
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...Databricks
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Databricks
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...Databricks
 

What's hot (20)

Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
 
Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigm
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
 
Graphql presentation
Graphql presentationGraphql presentation
Graphql presentation
 
[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI Ecosystem[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI Ecosystem
 
Managed Search: Presented by Jacob Graves, Getty Images
Managed Search: Presented by Jacob Graves, Getty ImagesManaged Search: Presented by Jacob Graves, Getty Images
Managed Search: Presented by Jacob Graves, Getty Images
 
A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...
A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...
A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu Ma
 
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
 
Validating credit cards on mobile using deep learning
Validating credit cards on mobile using deep learningValidating credit cards on mobile using deep learning
Validating credit cards on mobile using deep learning
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
 
The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines
 
Introduction to GraphQL: Mobile Week SF
Introduction to GraphQL: Mobile Week SFIntroduction to GraphQL: Mobile Week SF
Introduction to GraphQL: Mobile Week SF
 
How web works and browser works ? (behind the scenes)
How web works and browser works ? (behind the scenes)How web works and browser works ? (behind the scenes)
How web works and browser works ? (behind the scenes)
 
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsScalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
Pycon Talk
Pycon TalkPycon Talk
Pycon Talk
 

Similar to Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit

The Search for Better Search at Reddit - Nick Caldwell, Chris Slowe, and Luis...
The Search for Better Search at Reddit - Nick Caldwell, Chris Slowe, and Luis...The Search for Better Search at Reddit - Nick Caldwell, Chris Slowe, and Luis...
The Search for Better Search at Reddit - Nick Caldwell, Chris Slowe, and Luis...Lucidworks
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkCaserta
 
Conversion Models: A Systematic Method of Building Learning to Rank Training ...
Conversion Models: A Systematic Method of Building Learning to Rank Training ...Conversion Models: A Systematic Method of Building Learning to Rank Training ...
Conversion Models: A Systematic Method of Building Learning to Rank Training ...Lucidworks
 
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesExploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesSanjay Willie
 
Basic Level SEO Interview Questions.pdf
Basic Level SEO Interview Questions.pdfBasic Level SEO Interview Questions.pdf
Basic Level SEO Interview Questions.pdfSaritaM11
 
ChatGPT and AI for web developers - Maximiliano Firtman
ChatGPT and AI for web developers - Maximiliano FirtmanChatGPT and AI for web developers - Maximiliano Firtman
ChatGPT and AI for web developers - Maximiliano FirtmanWey Wey Web
 
best Digital Marketing ppt for all......
best Digital Marketing ppt for all......best Digital Marketing ppt for all......
best Digital Marketing ppt for all......Smayara
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataAndy Stretton
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Lucidworks
 
Build your own analytics power tools
Build your own analytics power toolsBuild your own analytics power tools
Build your own analytics power toolsAlban Gérôme
 
Uncovering 'not provided' keyword data
Uncovering 'not provided' keyword data Uncovering 'not provided' keyword data
Uncovering 'not provided' keyword data Clayton Wood
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
 
Birst Webinar Slides: "Build vs. Buy - Making the Right Choice for a Great Da...
Birst Webinar Slides: "Build vs. Buy - Making the Right Choice for a Great Da...Birst Webinar Slides: "Build vs. Buy - Making the Right Choice for a Great Da...
Birst Webinar Slides: "Build vs. Buy - Making the Right Choice for a Great Da...Birst
 
Crawlable Spatial Data - #Geo4Web research topic #3
Crawlable Spatial Data - #Geo4Web research topic #3Crawlable Spatial Data - #Geo4Web research topic #3
Crawlable Spatial Data - #Geo4Web research topic #3Dimitri van Hees
 
How to unlock the secrets of effortless keyword research with ChatGPT.pptx
How to unlock the secrets of effortless keyword research with ChatGPT.pptxHow to unlock the secrets of effortless keyword research with ChatGPT.pptx
How to unlock the secrets of effortless keyword research with ChatGPT.pptxDaniel Smullen
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019OpenSource Connections
 
STQA-Vol9-Issue2-March-2012-Software-Testing-Magazine
STQA-Vol9-Issue2-March-2012-Software-Testing-MagazineSTQA-Vol9-Issue2-March-2012-Software-Testing-Magazine
STQA-Vol9-Issue2-March-2012-Software-Testing-MagazineAlbert Gareev
 
Schema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowSchema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowRichard Wallis
 

Similar to Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit (20)

The Search for Better Search at Reddit - Nick Caldwell, Chris Slowe, and Luis...
The Search for Better Search at Reddit - Nick Caldwell, Chris Slowe, and Luis...The Search for Better Search at Reddit - Nick Caldwell, Chris Slowe, and Luis...
The Search for Better Search at Reddit - Nick Caldwell, Chris Slowe, and Luis...
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on Spark
 
Conversion Models: A Systematic Method of Building Learning to Rank Training ...
Conversion Models: A Systematic Method of Building Learning to Rank Training ...Conversion Models: A Systematic Method of Building Learning to Rank Training ...
Conversion Models: A Systematic Method of Building Learning to Rank Training ...
 
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesExploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
 
Basic Level SEO Interview Questions.pdf
Basic Level SEO Interview Questions.pdfBasic Level SEO Interview Questions.pdf
Basic Level SEO Interview Questions.pdf
 
Emperors new clothes_jab
Emperors new clothes_jabEmperors new clothes_jab
Emperors new clothes_jab
 
ChatGPT and AI for web developers - Maximiliano Firtman
ChatGPT and AI for web developers - Maximiliano FirtmanChatGPT and AI for web developers - Maximiliano Firtman
ChatGPT and AI for web developers - Maximiliano Firtman
 
best Digital Marketing ppt for all......
best Digital Marketing ppt for all......best Digital Marketing ppt for all......
best Digital Marketing ppt for all......
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
 
Build your own analytics power tools
Build your own analytics power toolsBuild your own analytics power tools
Build your own analytics power tools
 
Uncovering 'not provided' keyword data
Uncovering 'not provided' keyword data Uncovering 'not provided' keyword data
Uncovering 'not provided' keyword data
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
Birst Webinar Slides: "Build vs. Buy - Making the Right Choice for a Great Da...
Birst Webinar Slides: "Build vs. Buy - Making the Right Choice for a Great Da...Birst Webinar Slides: "Build vs. Buy - Making the Right Choice for a Great Da...
Birst Webinar Slides: "Build vs. Buy - Making the Right Choice for a Great Da...
 
Crawlable Spatial Data - #Geo4Web research topic #3
Crawlable Spatial Data - #Geo4Web research topic #3Crawlable Spatial Data - #Geo4Web research topic #3
Crawlable Spatial Data - #Geo4Web research topic #3
 
How to unlock the secrets of effortless keyword research with ChatGPT.pptx
How to unlock the secrets of effortless keyword research with ChatGPT.pptxHow to unlock the secrets of effortless keyword research with ChatGPT.pptx
How to unlock the secrets of effortless keyword research with ChatGPT.pptx
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019
 
STQA-Vol9-Issue2-March-2012-Software-Testing-Magazine
STQA-Vol9-Issue2-March-2012-Software-Testing-MagazineSTQA-Vol9-Issue2-March-2012-Software-Testing-Magazine
STQA-Vol9-Issue2-March-2012-Software-Testing-Magazine
 
How Google works
How Google worksHow Google works
How Google works
 
Schema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowSchema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & How
 

More from Lucidworks

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceLucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesLucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchLucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondLucidworks
 

More from Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Recently uploaded

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Recently uploaded (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit

  • 1. Making Reddit Search Relevant and Scalable Anupama Joshi Senior Engineering Manager, Search Jerry Bao Senior Software Engineer, Search
  • 2. Agenda • What is Reddit? • Search Architecture • Improving our Relevance • The History of Search @ Reddit • Scaling our Infrastructure • Q&A
  • 3. What is Reddit? Reddit is a network of communities where individuals can find experiences built around their interests, hobbies and passions It’s where people converse about the things that are most important to them
  • 4. Bring community and belonging to everyone Our mission
  • 5. Reddit by the numbers Alexa Rank (US/World) MAU Communities Posts per day Comments per day Votes per day Searches per day 5th/18th 400M+ 1M+ 440K+ 3.5M+ 82M+ 68M+
  • 6. So, what are we doing with all that power?
  • 7. Dog getting love 51.2k points (95% upvoted) Cat Fist Bumping 137.1k points (90% upvoted) 817.2k views
  • 8. Wait, it’s not just cat/dog pictures!
  • 9. Community > Content > Individual ● Authenticity ● Creative freedom ● Empathy @ scale ● Belonging ● Being heard
  • 13. None of that matters if you can’t FIND the content! So let’s talk about Search...
  • 14. User Retention ● New users who searched are 300% more likely to come back between D1 to D14 ● > 50% of all mobile users search
  • 18. Show and Tell: A better subreddit search Challenge: Redditors are very creative in their subreddit naming (e.g. r/superbowl is about superb owl pictures) which whilst fun, poses a challenge for discovery. Answer: faceted search on posts!
  • 19. Result: A better subreddit search
  • 20. Result: A better subreddit search
  • 21. Show and Tell: Better Post Search ● Post search with phrase matching of selftext The challenge: What about images and link posts? Answer - Comments ● Comments are important but which comments are most relevant to the post? ● How do we separate the signal from the noise? Answer - HVT ● HVTs are the highest scoring tf-idf terms from comment sections. ● Index and match on these HVTs along with post selftexts and titles.
  • 22. Result: Better Post Search Qualitatively, we saw some users notice almost immediately when we first introduced HVTs. For some queries, the difference is quite stark. The following are search results for the query ‘shabooya’. Note how ‘shabooya’ doesn’t appear anywhere in the title or the body of the first three post results, but you can see the phrase show up in the comments.
  • 23. Result: Better Post Search ● Post click through rate (CTR) (+3.15%), ● Relevancy ranking for navigational searches (MRR) (+4.01%) ● Search experience improvements for navigational searches due to increased recall on posts with poor title or body text
  • 24. Take It to the Next Level: Improve Search Relevance ● Learn from the users click statistics to automatically generate a relevancy model ● Rerank Search results based on aggregated Click Signal weights that users click higher on search results for a given query ○ Stream user events in Solr/Fusion cluster ○ Spark Jobs to aggregate click data ○ Use output from the aggregated signal to boost the search results
  • 25. Result: Post search relevance using signals 7.5 % Increase in CTR12.5 % increase in MRR
  • 26. Result: Subreddit search relevance using signals
  • 27. Head-Tail Analysis ● Spelling corrections. ● Tail Query Rewriting. ● Specific Dictionary based Rewriting
  • 28. Head-Tail Analysis A tail query like “lot of credit card debit” would be rewritten to produce better relevant results.
  • 29. Trending Searches ● Reddit can attribute week-over-week DAU growth to external events, like game releases, movie releases, and cultural events (reference). ● We see similar upticks in searches based on these events (reference). ● We believe that we can increase search engagement and time on site by leveraging these signals to highlight trending queries to users when they search on Reddit.
  • 30. NSFW Categorization ● Develop NSFW classification criteria ● Query Time classification based content filtering. ● Results boosting/reordering based on classification(boost or filter results based on knowing the query does/does not have NSFW intent) ● Look at the NSFW results in recall ● Look at the NSFW results people clicked ● Try open source Tensorflow libraries for auto detection of NSFW which is not marked NSFW
  • 31. Related Searches ● Train a collaborative filtering matrix decomposition recommender using SparkML's Alternating Least Squares (ALS) to batch compute query-query similarities ● Related Searches backend based on Collaborative Filtering & Co Occurrence Counting Algorithm via Temporal Proximity ● Collaborative filtering based recommender systems are a popular technique applied for movie recommendations at Netflix, or product recommendations in e-commerce sites like Amazon
  • 32. Related Searches ● Dynamic temporal buckets as source of data. ● All pairs irrespective of number of distinct queries in Session ● Length & temporal distance metrics to help with boosting recommendation. ● Intuitive & easily explainable. ● Scales extremely well for building pluggable logic & adding more dimensions.
  • 33. Related Searches *Query* —> *Related Searches* *learn* —> `learn programming`, `learn python`, `learn javascript`, `learn French`, `learn java`, `piano learn` *cats* —> `cat`, `aww`, `dogs`, `r/cats`, `r/comics`, `kittens`, `funny`, `pets` *dogs* —> `dog`, `dogs`, `aww`, `isle of dogs`, `isle of dogs discussion`, `cute dogs`, `pets` *infinity war* —> `avengers`, `piracy`, `infinity war stream`, `infinity war hd`, `avengers infinity war`, `avengers infinity war stream`, `deadpool2`, `infinity war torrent` *coming out*. —> `gay`, `lgbt` *makeup*. —> `beauty`, `make up`, `makeupaddiction`, `skincare`, `foundation`, `eyeshadow`, `wedding` *keto* —> `snacks`, `r/keto`, `r/progresspics`, `xxketo`, `keto recipes`, `keto diet`, `fasting` *programming* —> `r/politics`, `r/programming`, `programming`, `python`, `coding`, `learnprogramming`, `r/golang`, `r/programming` *Cohen* —> `sacha baron cohen`, `sasha baron cohen` `who is america`, `trump`, `jason spencer`, `sacha cohen`, `sasha cohen` *photography* —> `photo`, `r/Nikon`, `r/photography`, `camera`, `photos`, `art`, `r/bestof`, `instagram` *blep*. —> `mlem`
  • 35. What’s next ● Contextual Query Understanding ○ how context informs query understanding ● Understanding User Intent ○ classifying the query by its interpretation. The interpretation of the query can then be used to define intent ● Query rewriting and scoping ○ query rewriting technique that improves precision by matching each query segment to the right attribute ○ query tagging (special case of named-entity recognition (NER))
  • 37. Reddit Search has an interesting history... History of Reddit Search
  • 38. History of Reddit Search ● 2005 - Steve Huffman, cofounder and now CEO, implements postgres tsearch. ● 2006 - Chris Slowe, founding engineer and now CTO, implements pylucene. ○ “we fixed a bug in the search results ordering” - Steve Huffman ‘06 ○ “I made a quick fix to search that I hope helps until we get a chance to really fix it.” - Steve ‘07 ● 2008 - David King, first employee and former search engineer, implements Solr. ○ “[David]’s been fixing search and hacking mystery projects in Erlang.” - Alexis Ohanian ‘08 ○ “I’ve totally replaced the reddit search function.” - David King ‘08 ● 2010 - David King replaces Solr with IndexTank. ○ “We launched a new search engine yesterday. Calm down. It’s okay. I know. You’ve been hurt before.” - David King ‘10 ● 2012 - u/kemitche implements CloudSearch after LinkedIn shut down IndexTank “Q: Where do you see reddit in 10 years? A: Reddit search might work by then.” - Steve AMA ‘16
  • 39. Redditors told us how much they loved Search... “Reddit Search is great!” - said no redditor ever
  • 40. “This image should honestly replace the 503 error (all servers busy) page.” - u/seven0feleven
  • 41. “Ever since they moved away from scotch tape, I've been able to get irrelevant results in record time.” - u/El_Bandito_Blanquito
  • 42. In 2017, we set out to rebuild search from the ground up! Rebuilding Search
  • 43. Our First Cluster ● Create an AMI with Solr and Fusion packages installed ● Spin up servers with custom AMI ● SSH into each server ○ Install Fusion and Solr ○ Edit configuration files ○ Increase file descriptor limit ● Configured in AWS US West
  • 44. Our First Cluster Our new cluster was up and running well! We immediately started work on ingesting data and relevance tuning.
  • 45. But we ran into a couple of key issues when trying to scale up... Challenge #1
  • 46. Issues with Scaling our Solr Cluster ● Adding capacity to our cluster or changing instance types took a lot of effort ● Adding capacity our cluster meant that we needed to rebalance our cluster so that our replicas were equally distributed across machines ○ Solr 7+ introduced some basic autoscaling features but lacked policies to ensure a cluster was properly balanced ○ Rebalancing process was 100% manual ● Cross-region requests cost unnecessary latency ● As a result, our team was very cautious in scaling our cluster until it was absolutely needed, to reduce the number of times we scaled up
  • 48. Terraform + Puppet ● Together they allow us to programmatically make changes to infrastructure and server configuration quickly ● We can describe how we want servers to be setup ○ Install Java and Solr ○ Mount drives and add user groups/permissions ○ Set up Solr configuration files ● Modifications to servers and infra are reviewable, and revertible ● Rollout changes across our fleet with ease ● “Can you add more servers Jerry??” ○ No problem! One line code change.
  • 55. Equally Distribute by Availability Zone subreddits shard 1 replica 1 solr-01 us-east-1a replica 2 solr-02 us-east-1b replica 3 solr-03 us-east-1c shard 2 replica 1 solr-01 us-east-1a replica 2 solr-02 us-east-1b replica 3 solr-03 us-east-1c shard 3 replica 1 solr-01 us-east-1a replica 2 solr-02 us-east-1b replica 3 solr-03 us-east-1c
  • 56. No More Than 1 Replica From Same Shard subreddits shard 1 replica 1 solr-01 us-east-1a replica 2 solr-02 us-east-1b replica 3 solr-03 us-east-1c shard 2 replica 1 solr-01 us-east-1a replica 2 solr-02 us-east-1b replica 3 solr-03 us-east-1c shard 3 replica 1 solr-01 us-east-1a replica 2 solr-02 us-east-1b replica 3 solr-03 us-east-1c
  • 57. Equally Distribute Collection’s Replicas cluster solr-01 (us-east-1a) subreddits shard 1 replica 1 posts shard 1 replica 1 posts shard 2 replica 1 solr-04 (us-east-1a) posts shard 3 replica 1 solr-02 (us-east-1b) subreddits shard 1 replica 2 posts shard 1 replica 2 posts shard 2 replica 2 solr-03 (us-east-1c) subreddits shard 1 replica 3 posts shard 1 replica 3 posts shard 2 replica 3 solr-05 (us-east-1b) posts shard 3 replica 2 solr-06 (us-east-1c) posts shard 3 replica 3 subreddits - 1 shard; posts - 2 shards; each shard has 3 replicas
  • 58. Equally Distribute Cluster’s Replicas cluster solr-01 (us-east-1a) subreddits shard 1 replica 1 posts shard 1 replica 1 solr-04 (us-east-1a) posts shard 3 replica 1 posts shard 2 replica 1 solr-02 (us-east-1b) subreddits shard 1 replica 2 posts shard 1 replica 2 solr-03 (us-east-1c) subreddits shard 1 replica 3 posts shard 1 replica 3 solr-05 (us-east-1b) posts shard 3 replica 2 posts shard 2 replica 2 solr-06 (us-east-1c) posts shard 3 replica 3 posts shard 2 replica 3 subreddits - 1 shard; posts - 2 shards; each shard has 3 replicas
  • 59. Solr Rebalancing Tool ● Applied balancing rules in order ○ Check each shard’s availability zone distribution and replica distribution ○ Move replicas so that each collection’s replicas are on the most amount of machines ○ Move replicas so that each machine has the least amount of replicas possible ● Outputs list of operations to be performed and confirms with user each replica to move
  • 63. Our cluster was now scaling easily, but reindexing all of our data took many weeks... Challenge #2
  • 64. Indexing Data for Search ● Backfills ○ Pulls data from our datasource ○ Transforms it into the schema we need for indexing ○ Used to add/remove/change field indexing ● Streaming ○ Captures real-time updates so up-to-date information can be reflected in our indices ○ Transforms data the same way as backfills
  • 65. Why are fast backfills important? ● Quickly iterate on document schemas ● Test new ways to analyze document fields ● Create multiple clusters of the same data for testing ● Fix data issues rapidly
  • 67. Hive ● Pulled data from postgres with sqoop into Hive ● A series of transformations to ○ Join thing and data tables ○ Rotate the keys into columns ○ Store the final result as Parquet in S3 ● Fusion/Spark fetched S3 files and indexed data into Solr
  • 68.
  • 69. Issues with v1 ● Several weeks to transform data ○ Afraid of changing the schema ● Many stages of transformation, making it hard to debug and figure out how far upstream data transformation issues were ○ Hard to ensure the end result was correct
  • 70. Thing Service ● Search Service as the transformer and indexer of data ○ Fetches the latest data from the Thing Service ● Special logic in Thing Service made it easier to handle postgres data ○ Score of links, comments ○ Converting to actual data types (booleans, fullnames) ● Cut backfill time from multiple weeks to a single week with parallelization
  • 71.
  • 72. Issues with v2 ● Reliant upon a shared production service for what should be an offline job ○ We’ve pushed the thing service too hard with our backfills, affecting other services that rely upon it ● Other initiatives highlighted how slow our ingestion could get ○ HVTs (augmenting links with high value tokens from comments) ○ Attempts to index comment data
  • 73. Spark ● Running our own postgres replicas from wal-e backups in S3 ● Spark pulls data directly from postgres and transforms the data ● Can horizontally scale ingestion to be faster ○ Postgres to speed up ingestion of data into Spark ○ Spark to speed up transformation and joining of data ● We can adjust ingestion parallelism by repartitioning in the end ● Cut backfill time significantly from multiple weeks to days
  • 74. Random 100% CPU spikes prevented us from shipping search new features... Challenge #3
  • 75.
  • 76. Redditors Issue Expensive Queries ● High Recall Queries ○ the, would, you, ifs, news, games ● Crazy Queries ○ (AFD+OR+CDU+OR+CSU+OR+FDP+OR+Grünen+OR+SPD+OR+" Die+Linke"+OR+Energiepolitik+OR+Gesetze~+OR+Kabinetts~+O R+Regierungs~+OR+Referentenentwurf)+(Energiehandel~+OR+E nergiemanagement~+OR+Energiepreis~+OR+Energiesteuer~) ● These queries would take multiple seconds to complete, blocking a significant number of CPU cores in the cluster
  • 77. Cutting Queries Off ● Utilize timeAllowed in solrconfig.xml to prevent expensive queries taking up all of your cluster’s resources ○ NOTE: timeAllowed is not a hard cutoff. From the Solr docs: ○ As this check is periodically performed, the actual time for which a request can be processed before it is aborted would be marginally greater than or equal to the value of timeAllowed. If the request consumes more time in other stages, e.g., custom components, etc., this parameter is not expected to abort the request.
  • 78.
  • 80. Multi-Cluster Solr Environment ● One cluster per collection ● Hardware Isolation: one collections issues won’t affect other collections ● Scale each collection independently ● Balancing becomes really simple ○ Each machine has equally distributed number of replicas ○ Ensure AZ and shard awareness
  • 81. Solr 7.5 Autoscaling ● Solr 7.5 includes new policies that allow us to equally distribute replicas by ○ Arbitrary properties ○ Collection ○ Cluster ● Turn Solr Scaling into a one step process
  • 83. Thank you! Anupama Joshi anupama@reddit.com linkedin.com/in/anupamajoshi Jerry Bao jerry.bao@reddit.com linkedin.com/in/thejerrybao PS: We’re Hiring! reddit.com/jobs