Building multi billion ( dollars, users, documents ) search engines on open source technologies

Building multibillion search engines
on open source technologies
Andrei Lopatenko, PhD
Vice President of Engineering, Zillow Group

Who am I
Core contributor to Google Search (2006-2010), Apple AppStore/iTunes
Co-designed and Co-implemented Apple Maps Search (2010), Walmart Grocery
Has been leading search teams: Zillow (now), Walmart, eBay,
PhD in Computer Science, The University of Manchester, UK
My path: from core contributor of Google Search to leading search ecosystem of
market leaders of Real Estate (Zillow, Trulia), eCommerce (Walmart, eBay), digital
distribution (Apple),
Designing, building, implementing, improving, running multi-billion search engines
for last 15 years

My goal for this talk
I want to demonstrate that using open source signiﬁcantly helps in
● implementation
● continuous improvement both in search infrastructure and in search quality
● support and operations
of search engines for billions of users, billions of documents, (dozens) billions of
revenue/GMV
Typical search engine uses hundreds of open source products/libraries, I’ll focus on
some of them which are proven to be useful

How I am going to do it
Show a reference architecture of a typical multibillion search engine
Show typical open source based implementation
Show limitations of open source implementation (more to QA)
(My talk is limited to 30 min, so I’ll be brief, the topic deserves a day of tutorial)
The demo will be encyclopedic in the style, let’s go deeper in QA. I’ll focus more on
software used or tried in search I’ve built, rather than the comprehensive survey of
every open source package available (I might be biased sometimes) and I’ll focus on
what is useful for different types of search applications

Why multibillion search engine? Many billions of
what?
1. Multi billion dollars in revenue/GMV
2. Multi billion users -> Multi billions queries per day
3. Multi billion documents
How these numbers are relevant to the software architecture?

Multi billion dollars
Implies
1. Potential of hundreds millions to billions dollars per year gains because of
higher conversion due to better ranking, retrieval and query understanding
functions, latency improvements, UX
2. continuous work on search quality, search features, search infrastructure,
ranking improvements to get billion dollar gains
3. Complexity of the search stack and many of its features due to complexity of
business it supports

Multi billion users
Implies
1. Multi billions queries per day
2. tough throughput requirements
3. distributed deployments for high load

Multi billion documents
1. Distributed retrieval and ranking systems to run retrieval and ranking over
billions of documents
2. Complexity of search functions (one needs good ranking if there are too many
documents for a query)
3. Frequency of updates (even small update rate causes frequent updates if there
are billions of documents) -> update latency for index requirements

Are there many search engines based on open
source
Google, Amazon, eBay, MS/Bing use own technologies
1. For all of them, the search engine is the core of the business
2. The complexity of ranking, index etc is well beyond what open source provides
3. Huge monetary and user gains because of optimizations in the size of the index,
frequency of index updates, complexity of ranking/query understanding
functions, integration with other systems
4. Each company has a big search team to improve and develop their search
engines
Linkedin, ebay, airbnb moved from open source due to limitations of open source
search engines

Are there many search engines based on open
source
Walmart eCommerce - Solr moved to Elastic
Apple Store, iTunes (Solr)
Adobe Cloud (Elastic)
Uber (Solr)
Salesforce
(See conferences such as Activate and Elastic{On} for details how open source is
used in search

Typical Search Engines
Simple Search:
small collection of
document, low
sensitivity to user
satisfaction
Search SAAS
Algolia,
Google Custom
search
SwiftType (bought
by Elastic)
Searchify
Typical big search:
hundred
million/billion users,
billions revenue,
billions documents
Open source based
search engines
Very big search:
Billions users, 100s
billions documents,
high demand for
search quality, huge
prices for search
ops
Custom search
engines (Google,
Amazon etc)
Complexity, Size, Demand for quality, Operation cost
The focus of
this talk

Historical reasons are important
We build custom Apple Search in 2010 (C++). Solr/Lucene was not good at the
time for expected number of users (billion) and search quality. Now I would take
another decision
We build custom graph based search engine in 2016 (Ozlo, sold to Facebook). Open
source was not good enough for qps speed for graph language we needed to build
Natural language search, support of frequency of graph updates etc. Now open
source solutions for graph search engines are good for reasonable requirements

Typical Search Engine - High Level View
Data
Acquisition
Indexing
Ranking
Retrieval
Query
Understating
UX
Search
Assistance
Logging
Monitoring
Experiment
management
SERP logic
Other

Key Main Components
Aka Search Quality: Search assistance services (Autosuggest, dynamic facets etc),
Query Understanding, Ranking, SERP logic (snippet building, universal search -
mixing results of different corpora)
Aka Search Infrastructure: retrieval, loging, monitoring, ops
Aka Indexing: data acquisition (crawling web, feeds, data imports), includes data
enrichment (duplicate resolution, cleaning data, mapping into common dictionaries,
extraction), indexing (building index for retrieval systems)

Query Understanding
Task: process a query, parse, extract information from it, classify it, add information
useful for retrieval and ranking
Latency requirement: typical limit: 10 ms, up to 50ms
Throughput requirement: at least qps of the system, billions per day -> 10^5 qps
Frequently distributed: requires loading large vocabularies (language models,
embeddings) gigabytes, naturally many components for different classiﬁcation,
parsing, expansion tasks (up to hundreds), heavy CPU and GPU load

Query Understanding: Example
Query: 52 inches tv samsung alexa
Query Understanding:
Category: TV, size: 52 inch, brand: samsung additional: alexa
Relaxations: size: 48-60 inch, additional: alexa<- optional, brand: sharp, sony

Query Understanding - Open Source
1. Spacy
2. Fasttext
3. Stanford CoreNLP
4. Apache OpenNLP
5. Google Sling
Frequently low performant, one of my groups has to rewrite internals of Stanford
parser to make it performant

Query Understanding - Open Source
1. Hugging face - Transformers
2. Zalando Flair
3. Facebook PyText
4. ULMFiT
5. OpenNMT for machine translation

Query Understanding
Latency and throughput requirements are tight
A lot of cashing for hard to compute models (query distribution is skewed)
Fast performant models such as fasttext based has a lot of advantage

Query Understanding - Open Source - Tools
Doccano -- annotation tool, to create training sets
AllenNLP as a NLP RankLab, to test models
Snorkel Metal - great weak vision tools, weak super-vision is used for many practical
NLP tasks

Search Assistance
Autosuggest / Type ahead , Solr has Suggester, Elastic has similar features too
But building custom autosuggest on top of them is huge improvement
1. Better language modeling
2. Context (user, location)
3. Improved retrieval (substring, not preﬁx only, spell correction in auto suggest)
All are big improvements in quality/user satisfaction
Difference is big 30%+ engagement for ecommerce

Search Assistance Filter - Facets
Both Solr and Elastic generate Facets
But usually it requires to build your faceting on top of them to generate faceting
good for users
Different might be huge - 15+% conversion for eCommerce

Ranking - what is it?
1. a query with all information derived about the query
2. an user with all information known about the user
3. a set of retrieved results
Produce the rank of results optimized for highest success
Where success might be click through rate, conversion rate, GMV, revenue, other
engagement, satisfaction and monetization metrics
Latency requirement: typical limit: 10 ms, up to 50ms
Throughput requirement: at least qps of the system, billions per day -> 10^5 qps

Ranking - technologies - LeToR
LeToR (learning to rank) / MLR (machine learning ranking) function are proven to
be highly successful for ranking
1. complexity of many query/user/document features to be used for ranking
2. complexity of ranking functions
3. Turning for various metrics/re-tuning as metrics changes
May require intervention (policies and law regulations search engines, multi metric
optimization, internal policies not expressible in learnable functions)

Ranking - Open Source - Training
Popular approach
Proprietary LeToR ‘RankLab’ based on a LTR type such as gbdt or svm rank using
open source such as XGBoost, CatBoost, SVM light to learn ranking function
Converting model generates by previous level into the source code of your ranking
engine (java, c++)
Compiling and deploying ranking function as a part of ranking layer

Ranking - Open Source - Training
1. Solr / LTR contrib module- integrated in Solr,
2. Google TF Ranking
3. RankLib

Ranking - Feature Storage
Some ranking features are in index, many are not (due to storage limits,
performance costs)
Features must be retrieved during the ranking stage, for every query, ~1000+
documents during 2-3 stage of ranking, hundreds features -> Feature storage
Google Feast, CouchDB etc Redis
Wide column stores, document stores
Optimized for multi id lookups

Post ranking - feature layer
Many search systems require post ranking layer to re-arrange, remove results,
change ranking based on secondary criteria: current price which changes frequently,
availability in store/warehouse which changes frequently, optimizing cross-category
selling etc etc non-persistent features delivered by external systems (warehouse
management, sales management)
Key-value stores optimized for read-write, redis etc

Cashing
High load systems,
Cashing on top of the search to keep results of frequent queries to reduce load on the
system and have low latency for cached queries (no need to recompute result set).
Caches are implemented in Solr, Elastic
Frequently, cache is needed on the top of the search engine (all layer of rankings
after L0 - personalization, diversiﬁcation etc, merge of outputs of multiple
solr/elastic search engines to be cashed / aka universal search) - Redis, MemCashed

ML and NLP training / Rank Lab / ML tools
Mostly for ranking and query understanding, but used for many other things too :
duplicate resolution, anomaly detection
Spark
Platforms - getting maturity and popularity Kubeﬂow, ML Flow
Pytorch, Tensorﬂow - and building your ML stack on top of them
Xgboost, catboost , etc
(this slide deserves a separate 1 hour talk, as all other)

ML and NLP training / Rank Lab / ML tools
R/CRAN, despite its limits, is quite useful for many search ML tasks at exploration
stage and still some team use it for production learning of LeToR models
My experience: Time series R modules to learn time dependencies for ranking
function (for queries sensitive with fast topicality shift), R stat packages to learn
language afﬁnity features etc other ranking tasks
Analysis of A/B tests, contextual bandits and other types of experiments - typically
python code build on the top of scikit-learn, pandas, numpy

Retrieval
Given a query Q, return an ordered list <DocID, Score>
Lucene ecosystem Solr + Elastic for text documents, text scoring, multiple ﬁelds,
ﬁlters, sorts, spatial queries, Yahoo Vespa
Geographic: Solr Spatial,
Graphs: Neo4J, Apache Giraph, Titan DB
Vector search: Facebook FAISS, MS SPTAG
Google S2 geometry, JTS Topology Suite library - to implement spatial
indexing/retrieval

Retrieval
But , any real search application will require a lot of optimization on top of existing
system for index sizes, qps loads etc
Such as sharding, (depends on search application) etc

Retrieval - open source problems
Both Solr and Elastic are good for limited cluster sizes, serious problems scaling to
very large collections
There is huge progress recently (2010+) in new types of index compression
techniques which makes smaller index with fast retrieval - dozens of millions of
dollars of hardware costs for ‘big’ search etc
None of them are available in Solr/Elastic engines
Default query parsers are quite limited (in 2015 multi word synonyms were not
supported) -- making search engines requires a lot of work in internals of
Solr/Elastic

Indexing and Data Acquisition
1. Acquire - Get data from external sources (the web, feeds, etc etc) into internal
systems. Some data might be in bulk volumes (petabytes per day - web crawl),
some might be frequently updated (100s M updates per day - prices,
availability) etc etc
2. Enrich - Merge, resolve duplicates, remove noise, normalize, clean, enrich, - are
two pages, two items represent the same object? is latitude, longitude,
consistent with the address? Map full text descriptions of item into a set of
attributes with values ? Add derived signals (probability to sell, demand,
similarity to other items) to the item etc etc
3. Index - Map into format understandable by retrieval systems

Acquisition - open source
Crawling web: Apache Nutch and Hetrix, for scalable, many other for smaller scale
Apache Atlas / governance and discovery, data registry, data discovery
A lot of domain speciﬁc data acquisition tools: ACQ4 neurophysiology, open source
GISs
Transport: Apache Kafka, Apache Pulsar to bring data from external systems
Document stores, wide column stores: Cassandra/Scylla DB, Hbase, many other top
store acquired data before indexing -- depending on needs, data types, update
frequency

Enrich
Duplication resolution, cleaning, adding derived data - There is no good open source
for it.
A lot of workﬂow management , see another slide
Spark/Streaming, Flink at higher level / running jobs but nothing at the task level
(and task are too domain speciﬁc)

Indexing
Speciﬁc for the search and search engines provide their indexing
Solr and Elastic for text search , Vespa,
Facebook FAISS, MS SPTAG for similarity search
Neo4G and other graph systems

Acquisition, Enrichment, Indexing - orchestration
Typically, data pipelines of hundreds individual components
Integrating external data, internal data, derived data, cleaning, enriching
Orchestration becomes very important, multiple jobs written in different languages,
accessing different data stores,
AirFlow, Oozie, even luigi - depending on complexity, environment

Other useful stuff to build search systems
Low level: gRPC, protobuf, snappy, gﬂags, glogs, google benchmark, etc etc etc :
typically in search stack everywhere
Google S2 geometry or JTS topology suite - location based search
Gtest, xunitest etc testing
Jaeger or Apache SkyWalking for distributed tracing

Ops
Kubernetes
Knative, automated running serverless containers on k8s with autoscaling, revision
tracking,
Bloomberg Solr operator - to run solr on a K8S cluster
Netﬂix’s Chaos Monkey - chaos engineering for a distributed system as search, kill
instances periodically. Multibillion search- typically hundreds of services running on
many nodes, the system must be fault tolerant

Ops -
Solr Cloud - tools to set up fault tolerant, high available solr cluster
Elastic Search Cross Center Replication (CCR) - replication index across data
centers for elastic
Terraform, deﬁning and provisioning data centers for search ops
Jenkins, deployment

Other blocks
Experimentation : facebook planout, intuit wasabi, wix petri
Graphical , log reports: Graphana, Kibana
Logs: ELK Stack Logstash, ﬁlebeat, ﬂuentd,

Conversational Search Task oriented
Dialog management , NLU specific for conversational tasks, slot extraction, intent
classification - Rasa. You can use Rasa to manage dialog state (it is great in that
from learning to manual tools to analyze dialogs), but build own NLU
A plenty of research open source projects as outcomes of DSTC competitions
focused on various problems in building dialog systems, dialog act classification, slot
extraction, dialog break detection
Nvidia Nemo if you need own Automated Speech Recognition (tuning for a specific
language, domain, business )

Conversational Search QA over paragraphs
Results of academic research and competitions are frequently available online
Passage reRanking with BERT - dl4Marco BERT
Tanda from Alexa
YodaQA
cdQA
DrQA
Requires a lot of work to tune for your corpora

NL Search Over structured data
SQLNet
Uni of Freiburg Aqqu
Percy Liang’s Dependency Based Compositional Semantics
Mostly academic code

Conclusions
Open Source technologies are useful to build many parts of search engine stack from
low level code using libraries (ex gRPC ) to using open source systems ex:
Solr/Elastic, Kafka, Tensorﬂow Runtime
Each part will require a lot of work to tune it for your environment, code on top of it,
designing ops
Do not be surprised, if you have to rewrite certain open source to your
implementation as you search engine grow and you will need ‘extreme’ tuning,
performance, quality,

Conclusions
Search engine is a joint work of many, many people
There are many people who know xgboost, catboost, tensorflow, solr, kafka, kibana
Using open source simplifies hiring for expertise
Search systems require continuous never ending evolution: they need to be modified
and rewritten many times (module by module) - open source is a big advantage due
to access to the source and community help (sometimes)
Very few companies can afford search teams of hundreds/thousands people as in
very big search, even 20+ billion dollar search engine teams are quite small. OS
helps

Building multi billion ( dollars, users, documents ) search engines on open source technologies

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building multi billion ( dollars, users, documents ) search engines on open source technologies

Similar to Building multi billion ( dollars, users, documents ) search engines on open source technologies (20)

Recently uploaded

Recently uploaded (20)

Building multi billion ( dollars, users, documents ) search engines on open source technologies