How to use open source technologies to build search engines for billions of users, billions of revenue, billions of documents
Keynote talk at The 16th International Conference on Open Source Systems.
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Building multi billion ( dollars, users, documents ) search engines on open source technologies
1. Building multibillion search engines
on open source technologies
Andrei Lopatenko, PhD
Vice President of Engineering, Zillow Group
2. Who am I
Core contributor to Google Search (2006-2010), Apple AppStore/iTunes
Co-designed and Co-implemented Apple Maps Search (2010), Walmart Grocery
Has been leading search teams: Zillow (now), Walmart, eBay,
PhD in Computer Science, The University of Manchester, UK
My path: from core contributor of Google Search to leading search ecosystem of
market leaders of Real Estate (Zillow, Trulia), eCommerce (Walmart, eBay), digital
distribution (Apple),
Designing, building, implementing, improving, running multi-billion search engines
for last 15 years
3. My goal for this talk
I want to demonstrate that using open source significantly helps in
● implementation
● continuous improvement both in search infrastructure and in search quality
● support and operations
of search engines for billions of users, billions of documents, (dozens) billions of
revenue/GMV
Typical search engine uses hundreds of open source products/libraries, I’ll focus on
some of them which are proven to be useful
4. How I am going to do it
Show a reference architecture of a typical multibillion search engine
Show typical open source based implementation
Show limitations of open source implementation (more to QA)
(My talk is limited to 30 min, so I’ll be brief, the topic deserves a day of tutorial)
The demo will be encyclopedic in the style, let’s go deeper in QA. I’ll focus more on
software used or tried in search I’ve built, rather than the comprehensive survey of
every open source package available (I might be biased sometimes) and I’ll focus on
what is useful for different types of search applications
5. Why multibillion search engine? Many billions of
what?
1. Multi billion dollars in revenue/GMV
2. Multi billion users -> Multi billions queries per day
3. Multi billion documents
How these numbers are relevant to the software architecture?
6. Multi billion dollars
Implies
1. Potential of hundreds millions to billions dollars per year gains because of
higher conversion due to better ranking, retrieval and query understanding
functions, latency improvements, UX
2. continuous work on search quality, search features, search infrastructure,
ranking improvements to get billion dollar gains
3. Complexity of the search stack and many of its features due to complexity of
business it supports
7. Multi billion users
Implies
1. Multi billions queries per day
2. tough throughput requirements
3. distributed deployments for high load
8. Multi billion documents
1. Distributed retrieval and ranking systems to run retrieval and ranking over
billions of documents
2. Complexity of search functions (one needs good ranking if there are too many
documents for a query)
3. Frequency of updates (even small update rate causes frequent updates if there
are billions of documents) -> update latency for index requirements
9. Are there many search engines based on open
source
Google, Amazon, eBay, MS/Bing use own technologies
1. For all of them, the search engine is the core of the business
2. The complexity of ranking, index etc is well beyond what open source provides
3. Huge monetary and user gains because of optimizations in the size of the index,
frequency of index updates, complexity of ranking/query understanding
functions, integration with other systems
4. Each company has a big search team to improve and develop their search
engines
Linkedin, ebay, airbnb moved from open source due to limitations of open source
search engines
10. Are there many search engines based on open
source
Walmart eCommerce - Solr moved to Elastic
Apple Store, iTunes (Solr)
Adobe Cloud (Elastic)
Uber (Solr)
Salesforce
(See conferences such as Activate and Elastic{On} for details how open source is
used in search
11. Typical Search Engines
Simple Search:
small collection of
document, low
sensitivity to user
satisfaction
Search SAAS
Algolia,
Google Custom
search
SwiftType (bought
by Elastic)
Searchify
Typical big search:
hundred
million/billion users,
billions revenue,
billions documents
Open source based
search engines
Very big search:
Billions users, 100s
billions documents,
high demand for
search quality, huge
prices for search
ops
Custom search
engines (Google,
Amazon etc)
Complexity, Size, Demand for quality, Operation cost
The focus of
this talk
12. Historical reasons are important
We build custom Apple Search in 2010 (C++). Solr/Lucene was not good at the
time for expected number of users (billion) and search quality. Now I would take
another decision
We build custom graph based search engine in 2016 (Ozlo, sold to Facebook). Open
source was not good enough for qps speed for graph language we needed to build
Natural language search, support of frequency of graph updates etc. Now open
source solutions for graph search engines are good for reasonable requirements
13. Typical Search Engine - High Level View
Data
Acquisition
Indexing
Ranking
Retrieval
Query
Understating
UX
Search
Assistance
Logging
Monitoring
Experiment
management
SERP logic
Other
14. Key Main Components
Aka Search Quality: Search assistance services (Autosuggest, dynamic facets etc),
Query Understanding, Ranking, SERP logic (snippet building, universal search -
mixing results of different corpora)
Aka Search Infrastructure: retrieval, loging, monitoring, ops
Aka Indexing: data acquisition (crawling web, feeds, data imports), includes data
enrichment (duplicate resolution, cleaning data, mapping into common dictionaries,
extraction), indexing (building index for retrieval systems)
15. Query Understanding
Task: process a query, parse, extract information from it, classify it, add information
useful for retrieval and ranking
Latency requirement: typical limit: 10 ms, up to 50ms
Throughput requirement: at least qps of the system, billions per day -> 10^5 qps
Frequently distributed: requires loading large vocabularies (language models,
embeddings) gigabytes, naturally many components for different classification,
parsing, expansion tasks (up to hundreds), heavy CPU and GPU load
17. Query Understanding - Open Source
1. Spacy
2. Fasttext
3. Stanford CoreNLP
4. Apache OpenNLP
5. Google Sling
Frequently low performant, one of my groups has to rewrite internals of Stanford
parser to make it performant
18. Query Understanding - Open Source
1. Hugging face - Transformers
2. Zalando Flair
3. Facebook PyText
4. ULMFiT
5. OpenNMT for machine translation
19. Query Understanding
Latency and throughput requirements are tight
A lot of cashing for hard to compute models (query distribution is skewed)
Fast performant models such as fasttext based has a lot of advantage
20. Query Understanding - Open Source - Tools
Doccano -- annotation tool, to create training sets
AllenNLP as a NLP RankLab, to test models
Snorkel Metal - great weak vision tools, weak super-vision is used for many practical
NLP tasks
21. Search Assistance
Autosuggest / Type ahead , Solr has Suggester, Elastic has similar features too
But building custom autosuggest on top of them is huge improvement
1. Better language modeling
2. Context (user, location)
3. Improved retrieval (substring, not prefix only, spell correction in auto suggest)
All are big improvements in quality/user satisfaction
Difference is big 30%+ engagement for ecommerce
22. Search Assistance Filter - Facets
Both Solr and Elastic generate Facets
But usually it requires to build your faceting on top of them to generate faceting
good for users
Different might be huge - 15+% conversion for eCommerce
23. Ranking - what is it?
1. a query with all information derived about the query
2. an user with all information known about the user
3. a set of retrieved results
Produce the rank of results optimized for highest success
Where success might be click through rate, conversion rate, GMV, revenue, other
engagement, satisfaction and monetization metrics
Latency requirement: typical limit: 10 ms, up to 50ms
Throughput requirement: at least qps of the system, billions per day -> 10^5 qps
24. Ranking - technologies - LeToR
LeToR (learning to rank) / MLR (machine learning ranking) function are proven to
be highly successful for ranking
1. complexity of many query/user/document features to be used for ranking
2. complexity of ranking functions
3. Turning for various metrics/re-tuning as metrics changes
May require intervention (policies and law regulations search engines, multi metric
optimization, internal policies not expressible in learnable functions)
25. Ranking - Open Source - Training
Popular approach
Proprietary LeToR ‘RankLab’ based on a LTR type such as gbdt or svm rank using
open source such as XGBoost, CatBoost, SVM light to learn ranking function
Converting model generates by previous level into the source code of your ranking
engine (java, c++)
Compiling and deploying ranking function as a part of ranking layer
26. Ranking - Open Source - Training
1. Solr / LTR contrib module- integrated in Solr,
2. Google TF Ranking
3. RankLib
27. Ranking - Feature Storage
Some ranking features are in index, many are not (due to storage limits,
performance costs)
Features must be retrieved during the ranking stage, for every query, ~1000+
documents during 2-3 stage of ranking, hundreds features -> Feature storage
Google Feast, CouchDB etc Redis
Wide column stores, document stores
Optimized for multi id lookups
28. Post ranking - feature layer
Many search systems require post ranking layer to re-arrange, remove results,
change ranking based on secondary criteria: current price which changes frequently,
availability in store/warehouse which changes frequently, optimizing cross-category
selling etc etc non-persistent features delivered by external systems (warehouse
management, sales management)
Key-value stores optimized for read-write, redis etc
29. Cashing
High load systems,
Cashing on top of the search to keep results of frequent queries to reduce load on the
system and have low latency for cached queries (no need to recompute result set).
Caches are implemented in Solr, Elastic
Frequently, cache is needed on the top of the search engine (all layer of rankings
after L0 - personalization, diversification etc, merge of outputs of multiple
solr/elastic search engines to be cashed / aka universal search) - Redis, MemCashed
30. ML and NLP training / Rank Lab / ML tools
Mostly for ranking and query understanding, but used for many other things too :
duplicate resolution, anomaly detection
Spark
Platforms - getting maturity and popularity Kubeflow, ML Flow
Pytorch, Tensorflow - and building your ML stack on top of them
Xgboost, catboost , etc
(this slide deserves a separate 1 hour talk, as all other)
31. ML and NLP training / Rank Lab / ML tools
R/CRAN, despite its limits, is quite useful for many search ML tasks at exploration
stage and still some team use it for production learning of LeToR models
My experience: Time series R modules to learn time dependencies for ranking
function (for queries sensitive with fast topicality shift), R stat packages to learn
language affinity features etc other ranking tasks
Analysis of A/B tests, contextual bandits and other types of experiments - typically
python code build on the top of scikit-learn, pandas, numpy
32. Retrieval
Given a query Q, return an ordered list <DocID, Score>
Lucene ecosystem Solr + Elastic for text documents, text scoring, multiple fields,
filters, sorts, spatial queries, Yahoo Vespa
Geographic: Solr Spatial,
Graphs: Neo4J, Apache Giraph, Titan DB
Vector search: Facebook FAISS, MS SPTAG
Google S2 geometry, JTS Topology Suite library - to implement spatial
indexing/retrieval
33. Retrieval
But , any real search application will require a lot of optimization on top of existing
system for index sizes, qps loads etc
Such as sharding, (depends on search application) etc
34. Retrieval - open source problems
Both Solr and Elastic are good for limited cluster sizes, serious problems scaling to
very large collections
There is huge progress recently (2010+) in new types of index compression
techniques which makes smaller index with fast retrieval - dozens of millions of
dollars of hardware costs for ‘big’ search etc
None of them are available in Solr/Elastic engines
Default query parsers are quite limited (in 2015 multi word synonyms were not
supported) -- making search engines requires a lot of work in internals of
Solr/Elastic
35. Indexing and Data Acquisition
1. Acquire - Get data from external sources (the web, feeds, etc etc) into internal
systems. Some data might be in bulk volumes (petabytes per day - web crawl),
some might be frequently updated (100s M updates per day - prices,
availability) etc etc
2. Enrich - Merge, resolve duplicates, remove noise, normalize, clean, enrich, - are
two pages, two items represent the same object? is latitude, longitude,
consistent with the address? Map full text descriptions of item into a set of
attributes with values ? Add derived signals (probability to sell, demand,
similarity to other items) to the item etc etc
3. Index - Map into format understandable by retrieval systems
36. Acquisition - open source
Crawling web: Apache Nutch and Hetrix, for scalable, many other for smaller scale
Apache Atlas / governance and discovery, data registry, data discovery
A lot of domain specific data acquisition tools: ACQ4 neurophysiology, open source
GISs
Transport: Apache Kafka, Apache Pulsar to bring data from external systems
Document stores, wide column stores: Cassandra/Scylla DB, Hbase, many other top
store acquired data before indexing -- depending on needs, data types, update
frequency
37. Enrich
Duplication resolution, cleaning, adding derived data - There is no good open source
for it.
A lot of workflow management , see another slide
Spark/Streaming, Flink at higher level / running jobs but nothing at the task level
(and task are too domain specific)
38. Indexing
Specific for the search and search engines provide their indexing
Solr and Elastic for text search , Vespa,
Facebook FAISS, MS SPTAG for similarity search
Neo4G and other graph systems
39. Acquisition, Enrichment, Indexing - orchestration
Typically, data pipelines of hundreds individual components
Integrating external data, internal data, derived data, cleaning, enriching
Orchestration becomes very important, multiple jobs written in different languages,
accessing different data stores,
AirFlow, Oozie, even luigi - depending on complexity, environment
40. Other useful stuff to build search systems
Low level: gRPC, protobuf, snappy, gflags, glogs, google benchmark, etc etc etc :
typically in search stack everywhere
Google S2 geometry or JTS topology suite - location based search
Gtest, xunitest etc testing
Jaeger or Apache SkyWalking for distributed tracing
41. Ops
Kubernetes
Knative, automated running serverless containers on k8s with autoscaling, revision
tracking,
Bloomberg Solr operator - to run solr on a K8S cluster
Netflix’s Chaos Monkey - chaos engineering for a distributed system as search, kill
instances periodically. Multibillion search- typically hundreds of services running on
many nodes, the system must be fault tolerant
42. Ops -
Solr Cloud - tools to set up fault tolerant, high available solr cluster
Elastic Search Cross Center Replication (CCR) - replication index across data
centers for elastic
Terraform, defining and provisioning data centers for search ops
Jenkins, deployment
44. Conversational Search Task oriented
Dialog management , NLU specific for conversational tasks, slot extraction, intent
classification - Rasa. You can use Rasa to manage dialog state (it is great in that
from learning to manual tools to analyze dialogs), but build own NLU
A plenty of research open source projects as outcomes of DSTC competitions
focused on various problems in building dialog systems, dialog act classification, slot
extraction, dialog break detection
Nvidia Nemo if you need own Automated Speech Recognition (tuning for a specific
language, domain, business )
45. Conversational Search QA over paragraphs
Results of academic research and competitions are frequently available online
Passage reRanking with BERT - dl4Marco BERT
Tanda from Alexa
YodaQA
cdQA
DrQA
Requires a lot of work to tune for your corpora
46. NL Search Over structured data
SQLNet
Uni of Freiburg Aqqu
Percy Liang’s Dependency Based Compositional Semantics
Mostly academic code
47. Conclusions
Open Source technologies are useful to build many parts of search engine stack from
low level code using libraries (ex gRPC ) to using open source systems ex:
Solr/Elastic, Kafka, Tensorflow Runtime
Each part will require a lot of work to tune it for your environment, code on top of it,
designing ops
Do not be surprised, if you have to rewrite certain open source to your
implementation as you search engine grow and you will need ‘extreme’ tuning,
performance, quality,
48. Conclusions
Search engine is a joint work of many, many people
There are many people who know xgboost, catboost, tensorflow, solr, kafka, kibana
Using open source simplifies hiring for expertise
Search systems require continuous never ending evolution: they need to be modified
and rewritten many times (module by module) - open source is a big advantage due
to access to the source and community help (sometimes)
Very few companies can afford search teams of hundreds/thousands people as in
very big search, even 20+ billion dollar search engine teams are quite small. OS
helps