SlideShare a Scribd company logo
1 of 48
Download to read offline
Building multibillion search engines
on open source technologies
Andrei Lopatenko, PhD
Vice President of Engineering, Zillow Group
Who am I
Core contributor to Google Search (2006-2010), Apple AppStore/iTunes
Co-designed and Co-implemented Apple Maps Search (2010), Walmart Grocery
Has been leading search teams: Zillow (now), Walmart, eBay,
PhD in Computer Science, The University of Manchester, UK
My path: from core contributor of Google Search to leading search ecosystem of
market leaders of Real Estate (Zillow, Trulia), eCommerce (Walmart, eBay), digital
distribution (Apple),
Designing, building, implementing, improving, running multi-billion search engines
for last 15 years
My goal for this talk
I want to demonstrate that using open source significantly helps in
● implementation
● continuous improvement both in search infrastructure and in search quality
● support and operations
of search engines for billions of users, billions of documents, (dozens) billions of
revenue/GMV
Typical search engine uses hundreds of open source products/libraries, I’ll focus on
some of them which are proven to be useful
How I am going to do it
Show a reference architecture of a typical multibillion search engine
Show typical open source based implementation
Show limitations of open source implementation (more to QA)
(My talk is limited to 30 min, so I’ll be brief, the topic deserves a day of tutorial)
The demo will be encyclopedic in the style, let’s go deeper in QA. I’ll focus more on
software used or tried in search I’ve built, rather than the comprehensive survey of
every open source package available (I might be biased sometimes) and I’ll focus on
what is useful for different types of search applications
Why multibillion search engine? Many billions of
what?
1. Multi billion dollars in revenue/GMV
2. Multi billion users -> Multi billions queries per day
3. Multi billion documents
How these numbers are relevant to the software architecture?
Multi billion dollars
Implies
1. Potential of hundreds millions to billions dollars per year gains because of
higher conversion due to better ranking, retrieval and query understanding
functions, latency improvements, UX
2. continuous work on search quality, search features, search infrastructure,
ranking improvements to get billion dollar gains
3. Complexity of the search stack and many of its features due to complexity of
business it supports
Multi billion users
Implies
1. Multi billions queries per day
2. tough throughput requirements
3. distributed deployments for high load
Multi billion documents
1. Distributed retrieval and ranking systems to run retrieval and ranking over
billions of documents
2. Complexity of search functions (one needs good ranking if there are too many
documents for a query)
3. Frequency of updates (even small update rate causes frequent updates if there
are billions of documents) -> update latency for index requirements
Are there many search engines based on open
source
Google, Amazon, eBay, MS/Bing use own technologies
1. For all of them, the search engine is the core of the business
2. The complexity of ranking, index etc is well beyond what open source provides
3. Huge monetary and user gains because of optimizations in the size of the index,
frequency of index updates, complexity of ranking/query understanding
functions, integration with other systems
4. Each company has a big search team to improve and develop their search
engines
Linkedin, ebay, airbnb moved from open source due to limitations of open source
search engines
Are there many search engines based on open
source
Walmart eCommerce - Solr moved to Elastic
Apple Store, iTunes (Solr)
Adobe Cloud (Elastic)
Uber (Solr)
Salesforce
(See conferences such as Activate and Elastic{On} for details how open source is
used in search
Typical Search Engines
Simple Search:
small collection of
document, low
sensitivity to user
satisfaction
Search SAAS
Algolia,
Google Custom
search
SwiftType (bought
by Elastic)
Searchify
Typical big search:
hundred
million/billion users,
billions revenue,
billions documents
Open source based
search engines
Very big search:
Billions users, 100s
billions documents,
high demand for
search quality, huge
prices for search
ops
Custom search
engines (Google,
Amazon etc)
Complexity, Size, Demand for quality, Operation cost
The focus of
this talk
Historical reasons are important
We build custom Apple Search in 2010 (C++). Solr/Lucene was not good at the
time for expected number of users (billion) and search quality. Now I would take
another decision
We build custom graph based search engine in 2016 (Ozlo, sold to Facebook). Open
source was not good enough for qps speed for graph language we needed to build
Natural language search, support of frequency of graph updates etc. Now open
source solutions for graph search engines are good for reasonable requirements
Typical Search Engine - High Level View
Data
Acquisition
Indexing
Ranking
Retrieval
Query
Understating
UX
Search
Assistance
Logging
Monitoring
Experiment
management
SERP logic
Other
Key Main Components
Aka Search Quality: Search assistance services (Autosuggest, dynamic facets etc),
Query Understanding, Ranking, SERP logic (snippet building, universal search -
mixing results of different corpora)
Aka Search Infrastructure: retrieval, loging, monitoring, ops
Aka Indexing: data acquisition (crawling web, feeds, data imports), includes data
enrichment (duplicate resolution, cleaning data, mapping into common dictionaries,
extraction), indexing (building index for retrieval systems)
Query Understanding
Task: process a query, parse, extract information from it, classify it, add information
useful for retrieval and ranking
Latency requirement: typical limit: 10 ms, up to 50ms
Throughput requirement: at least qps of the system, billions per day -> 10^5 qps
Frequently distributed: requires loading large vocabularies (language models,
embeddings) gigabytes, naturally many components for different classification,
parsing, expansion tasks (up to hundreds), heavy CPU and GPU load
Query Understanding: Example
Query: 52 inches tv samsung alexa
Query Understanding:
Category: TV, size: 52 inch, brand: samsung additional: alexa
Relaxations: size: 48-60 inch, additional: alexa<- optional, brand: sharp, sony
Query Understanding - Open Source
1. Spacy
2. Fasttext
3. Stanford CoreNLP
4. Apache OpenNLP
5. Google Sling
Frequently low performant, one of my groups has to rewrite internals of Stanford
parser to make it performant
Query Understanding - Open Source
1. Hugging face - Transformers
2. Zalando Flair
3. Facebook PyText
4. ULMFiT
5. OpenNMT for machine translation
Query Understanding
Latency and throughput requirements are tight
A lot of cashing for hard to compute models (query distribution is skewed)
Fast performant models such as fasttext based has a lot of advantage
Query Understanding - Open Source - Tools
Doccano -- annotation tool, to create training sets
AllenNLP as a NLP RankLab, to test models
Snorkel Metal - great weak vision tools, weak super-vision is used for many practical
NLP tasks
Search Assistance
Autosuggest / Type ahead , Solr has Suggester, Elastic has similar features too
But building custom autosuggest on top of them is huge improvement
1. Better language modeling
2. Context (user, location)
3. Improved retrieval (substring, not prefix only, spell correction in auto suggest)
All are big improvements in quality/user satisfaction
Difference is big 30%+ engagement for ecommerce
Search Assistance Filter - Facets
Both Solr and Elastic generate Facets
But usually it requires to build your faceting on top of them to generate faceting
good for users
Different might be huge - 15+% conversion for eCommerce
Ranking - what is it?
1. a query with all information derived about the query
2. an user with all information known about the user
3. a set of retrieved results
Produce the rank of results optimized for highest success
Where success might be click through rate, conversion rate, GMV, revenue, other
engagement, satisfaction and monetization metrics
Latency requirement: typical limit: 10 ms, up to 50ms
Throughput requirement: at least qps of the system, billions per day -> 10^5 qps
Ranking - technologies - LeToR
LeToR (learning to rank) / MLR (machine learning ranking) function are proven to
be highly successful for ranking
1. complexity of many query/user/document features to be used for ranking
2. complexity of ranking functions
3. Turning for various metrics/re-tuning as metrics changes
May require intervention (policies and law regulations search engines, multi metric
optimization, internal policies not expressible in learnable functions)
Ranking - Open Source - Training
Popular approach
Proprietary LeToR ‘RankLab’ based on a LTR type such as gbdt or svm rank using
open source such as XGBoost, CatBoost, SVM light to learn ranking function
Converting model generates by previous level into the source code of your ranking
engine (java, c++)
Compiling and deploying ranking function as a part of ranking layer
Ranking - Open Source - Training
1. Solr / LTR contrib module- integrated in Solr,
2. Google TF Ranking
3. RankLib
Ranking - Feature Storage
Some ranking features are in index, many are not (due to storage limits,
performance costs)
Features must be retrieved during the ranking stage, for every query, ~1000+
documents during 2-3 stage of ranking, hundreds features -> Feature storage
Google Feast, CouchDB etc Redis
Wide column stores, document stores
Optimized for multi id lookups
Post ranking - feature layer
Many search systems require post ranking layer to re-arrange, remove results,
change ranking based on secondary criteria: current price which changes frequently,
availability in store/warehouse which changes frequently, optimizing cross-category
selling etc etc non-persistent features delivered by external systems (warehouse
management, sales management)
Key-value stores optimized for read-write, redis etc
Cashing
High load systems,
Cashing on top of the search to keep results of frequent queries to reduce load on the
system and have low latency for cached queries (no need to recompute result set).
Caches are implemented in Solr, Elastic
Frequently, cache is needed on the top of the search engine (all layer of rankings
after L0 - personalization, diversification etc, merge of outputs of multiple
solr/elastic search engines to be cashed / aka universal search) - Redis, MemCashed
ML and NLP training / Rank Lab / ML tools
Mostly for ranking and query understanding, but used for many other things too :
duplicate resolution, anomaly detection
Spark
Platforms - getting maturity and popularity Kubeflow, ML Flow
Pytorch, Tensorflow - and building your ML stack on top of them
Xgboost, catboost , etc
(this slide deserves a separate 1 hour talk, as all other)
ML and NLP training / Rank Lab / ML tools
R/CRAN, despite its limits, is quite useful for many search ML tasks at exploration
stage and still some team use it for production learning of LeToR models
My experience: Time series R modules to learn time dependencies for ranking
function (for queries sensitive with fast topicality shift), R stat packages to learn
language affinity features etc other ranking tasks
Analysis of A/B tests, contextual bandits and other types of experiments - typically
python code build on the top of scikit-learn, pandas, numpy
Retrieval
Given a query Q, return an ordered list <DocID, Score>
Lucene ecosystem Solr + Elastic for text documents, text scoring, multiple fields,
filters, sorts, spatial queries, Yahoo Vespa
Geographic: Solr Spatial,
Graphs: Neo4J, Apache Giraph, Titan DB
Vector search: Facebook FAISS, MS SPTAG
Google S2 geometry, JTS Topology Suite library - to implement spatial
indexing/retrieval
Retrieval
But , any real search application will require a lot of optimization on top of existing
system for index sizes, qps loads etc
Such as sharding, (depends on search application) etc
Retrieval - open source problems
Both Solr and Elastic are good for limited cluster sizes, serious problems scaling to
very large collections
There is huge progress recently (2010+) in new types of index compression
techniques which makes smaller index with fast retrieval - dozens of millions of
dollars of hardware costs for ‘big’ search etc
None of them are available in Solr/Elastic engines
Default query parsers are quite limited (in 2015 multi word synonyms were not
supported) -- making search engines requires a lot of work in internals of
Solr/Elastic
Indexing and Data Acquisition
1. Acquire - Get data from external sources (the web, feeds, etc etc) into internal
systems. Some data might be in bulk volumes (petabytes per day - web crawl),
some might be frequently updated (100s M updates per day - prices,
availability) etc etc
2. Enrich - Merge, resolve duplicates, remove noise, normalize, clean, enrich, - are
two pages, two items represent the same object? is latitude, longitude,
consistent with the address? Map full text descriptions of item into a set of
attributes with values ? Add derived signals (probability to sell, demand,
similarity to other items) to the item etc etc
3. Index - Map into format understandable by retrieval systems
Acquisition - open source
Crawling web: Apache Nutch and Hetrix, for scalable, many other for smaller scale
Apache Atlas / governance and discovery, data registry, data discovery
A lot of domain specific data acquisition tools: ACQ4 neurophysiology, open source
GISs
Transport: Apache Kafka, Apache Pulsar to bring data from external systems
Document stores, wide column stores: Cassandra/Scylla DB, Hbase, many other top
store acquired data before indexing -- depending on needs, data types, update
frequency
Enrich
Duplication resolution, cleaning, adding derived data - There is no good open source
for it.
A lot of workflow management , see another slide
Spark/Streaming, Flink at higher level / running jobs but nothing at the task level
(and task are too domain specific)
Indexing
Specific for the search and search engines provide their indexing
Solr and Elastic for text search , Vespa,
Facebook FAISS, MS SPTAG for similarity search
Neo4G and other graph systems
Acquisition, Enrichment, Indexing - orchestration
Typically, data pipelines of hundreds individual components
Integrating external data, internal data, derived data, cleaning, enriching
Orchestration becomes very important, multiple jobs written in different languages,
accessing different data stores,
AirFlow, Oozie, even luigi - depending on complexity, environment
Other useful stuff to build search systems
Low level: gRPC, protobuf, snappy, gflags, glogs, google benchmark, etc etc etc :
typically in search stack everywhere
Google S2 geometry or JTS topology suite - location based search
Gtest, xunitest etc testing
Jaeger or Apache SkyWalking for distributed tracing
Ops
Kubernetes
Knative, automated running serverless containers on k8s with autoscaling, revision
tracking,
Bloomberg Solr operator - to run solr on a K8S cluster
Netflix’s Chaos Monkey - chaos engineering for a distributed system as search, kill
instances periodically. Multibillion search- typically hundreds of services running on
many nodes, the system must be fault tolerant
Ops -
Solr Cloud - tools to set up fault tolerant, high available solr cluster
Elastic Search Cross Center Replication (CCR) - replication index across data
centers for elastic
Terraform, defining and provisioning data centers for search ops
Jenkins, deployment
Other blocks
Experimentation : facebook planout, intuit wasabi, wix petri
Graphical , log reports: Graphana, Kibana
Logs: ELK Stack Logstash, filebeat, fluentd,
Conversational Search Task oriented
Dialog management , NLU specific for conversational tasks, slot extraction, intent
classification - Rasa. You can use Rasa to manage dialog state (it is great in that
from learning to manual tools to analyze dialogs), but build own NLU
A plenty of research open source projects as outcomes of DSTC competitions
focused on various problems in building dialog systems, dialog act classification, slot
extraction, dialog break detection
Nvidia Nemo if you need own Automated Speech Recognition (tuning for a specific
language, domain, business )
Conversational Search QA over paragraphs
Results of academic research and competitions are frequently available online
Passage reRanking with BERT - dl4Marco BERT
Tanda from Alexa
YodaQA
cdQA
DrQA
Requires a lot of work to tune for your corpora
NL Search Over structured data
SQLNet
Uni of Freiburg Aqqu
Percy Liang’s Dependency Based Compositional Semantics
Mostly academic code
Conclusions
Open Source technologies are useful to build many parts of search engine stack from
low level code using libraries (ex gRPC ) to using open source systems ex:
Solr/Elastic, Kafka, Tensorflow Runtime
Each part will require a lot of work to tune it for your environment, code on top of it,
designing ops
Do not be surprised, if you have to rewrite certain open source to your
implementation as you search engine grow and you will need ‘extreme’ tuning,
performance, quality,
Conclusions
Search engine is a joint work of many, many people
There are many people who know xgboost, catboost, tensorflow, solr, kafka, kibana
Using open source simplifies hiring for expertise
Search systems require continuous never ending evolution: they need to be modified
and rewritten many times (module by module) - open source is a big advantage due
to access to the source and community help (sometimes)
Very few companies can afford search teams of hundreds/thousands people as in
very big search, even 20+ billion dollar search engine teams are quite small. OS
helps

More Related Content

What's hot

SEO-Campixx 2022 | Suchoperatoren auf Steroiden
SEO-Campixx 2022 | Suchoperatoren auf SteroidenSEO-Campixx 2022 | Suchoperatoren auf Steroiden
SEO-Campixx 2022 | Suchoperatoren auf SteroidenPaul Schreiner
 
MeasureFest July 2021 - Session Segmentation with Machine Learning
MeasureFest July 2021 - Session Segmentation with Machine LearningMeasureFest July 2021 - Session Segmentation with Machine Learning
MeasureFest July 2021 - Session Segmentation with Machine LearningRichard Lawrence
 
AI SEO Presentation
AI SEO PresentationAI SEO Presentation
AI SEO Presentationaiseoadmin
 
How to create an SEO data-driven content strategy
How to create an SEO data-driven content strategyHow to create an SEO data-driven content strategy
How to create an SEO data-driven content strategyKevin Gibbons
 
Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupIntroduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupTushar Mittal
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Fwdays
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python Viren Rajput
 
The Evolution of AutoML
The Evolution of AutoMLThe Evolution of AutoML
The Evolution of AutoMLNing Jiang
 
FinTech Hackathon Presentation: Finny
FinTech Hackathon Presentation: FinnyFinTech Hackathon Presentation: Finny
FinTech Hackathon Presentation: Finnyams345
 
Kaggle Days Madrid - Alberto Danese
Kaggle Days Madrid - Alberto DaneseKaggle Days Madrid - Alberto Danese
Kaggle Days Madrid - Alberto DaneseAlberto Danese
 
Natural language processing PPT presentation
Natural language processing PPT presentationNatural language processing PPT presentation
Natural language processing PPT presentationSai Mohith
 
A/B Testing with Yammer's Product Manager
A/B Testing with Yammer's Product ManagerA/B Testing with Yammer's Product Manager
A/B Testing with Yammer's Product ManagerProduct School
 
Generative AI and SEO
Generative AI and SEOGenerative AI and SEO
Generative AI and SEOJason Packer
 
OWL-XML-Summer-School-09
OWL-XML-Summer-School-09OWL-XML-Summer-School-09
OWL-XML-Summer-School-09Duncan Hull
 
Google Analytics for Beginners - Training
Google Analytics for Beginners - TrainingGoogle Analytics for Beginners - Training
Google Analytics for Beginners - TrainingRuben Vezzoli
 
Microsoft Cognitive Services
Microsoft Cognitive ServicesMicrosoft Cognitive Services
Microsoft Cognitive ServicesShahed Chowdhuri
 
Keyword Research Process
Keyword Research ProcessKeyword Research Process
Keyword Research ProcessRakesh Kumar
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With PythonRobert Dempsey
 

What's hot (20)

SEO-Campixx 2022 | Suchoperatoren auf Steroiden
SEO-Campixx 2022 | Suchoperatoren auf SteroidenSEO-Campixx 2022 | Suchoperatoren auf Steroiden
SEO-Campixx 2022 | Suchoperatoren auf Steroiden
 
MeasureFest July 2021 - Session Segmentation with Machine Learning
MeasureFest July 2021 - Session Segmentation with Machine LearningMeasureFest July 2021 - Session Segmentation with Machine Learning
MeasureFest July 2021 - Session Segmentation with Machine Learning
 
AI SEO Presentation
AI SEO PresentationAI SEO Presentation
AI SEO Presentation
 
WEB Scraping.pptx
WEB Scraping.pptxWEB Scraping.pptx
WEB Scraping.pptx
 
How to create an SEO data-driven content strategy
How to create an SEO data-driven content strategyHow to create an SEO data-driven content strategy
How to create an SEO data-driven content strategy
 
Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupIntroduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful Soup
 
Tutorial on Web Scraping in Python
Tutorial on Web Scraping in PythonTutorial on Web Scraping in Python
Tutorial on Web Scraping in Python
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
The Evolution of AutoML
The Evolution of AutoMLThe Evolution of AutoML
The Evolution of AutoML
 
FinTech Hackathon Presentation: Finny
FinTech Hackathon Presentation: FinnyFinTech Hackathon Presentation: Finny
FinTech Hackathon Presentation: Finny
 
Kaggle Days Madrid - Alberto Danese
Kaggle Days Madrid - Alberto DaneseKaggle Days Madrid - Alberto Danese
Kaggle Days Madrid - Alberto Danese
 
Natural language processing PPT presentation
Natural language processing PPT presentationNatural language processing PPT presentation
Natural language processing PPT presentation
 
A/B Testing with Yammer's Product Manager
A/B Testing with Yammer's Product ManagerA/B Testing with Yammer's Product Manager
A/B Testing with Yammer's Product Manager
 
Generative AI and SEO
Generative AI and SEOGenerative AI and SEO
Generative AI and SEO
 
OWL-XML-Summer-School-09
OWL-XML-Summer-School-09OWL-XML-Summer-School-09
OWL-XML-Summer-School-09
 
Google Analytics for Beginners - Training
Google Analytics for Beginners - TrainingGoogle Analytics for Beginners - Training
Google Analytics for Beginners - Training
 
Microsoft Cognitive Services
Microsoft Cognitive ServicesMicrosoft Cognitive Services
Microsoft Cognitive Services
 
Keyword Research Process
Keyword Research ProcessKeyword Research Process
Keyword Research Process
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 

Similar to Building multi billion ( dollars, users, documents ) search engines on open source technologies

Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014francelabs
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
 
Interleaving, Evaluation to Self-learning Search @904Labs
Interleaving, Evaluation to Self-learning Search @904LabsInterleaving, Evaluation to Self-learning Search @904Labs
Interleaving, Evaluation to Self-learning Search @904LabsJohn T. Kane
 
The anatomy of google
The anatomy of googleThe anatomy of google
The anatomy of googlemaelmardi
 
ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010steverz
 
The Searchmaster's Toolbox - David Hawking, Funnelback Search
The Searchmaster's Toolbox - David Hawking, Funnelback SearchThe Searchmaster's Toolbox - David Hawking, Funnelback Search
The Searchmaster's Toolbox - David Hawking, Funnelback SearchSquiz
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchErudite
 
Natural Language Processing at Scale
Natural Language Processing at ScaleNatural Language Processing at Scale
Natural Language Processing at ScaleAndrei Lopatenko
 
AI in multi billion search engines. Building AI and Search teams
AI in multi billion search engines. Building AI and Search teamsAI in multi billion search engines. Building AI and Search teams
AI in multi billion search engines. Building AI and Search teamsAndrei Lopatenko
 
Екатерина Гордиенко (Serpstat)
Екатерина Гордиенко (Serpstat)Екатерина Гордиенко (Serpstat)
Екатерина Гордиенко (Serpstat)Octopus Events
 
Google Research Paper
Google Research PaperGoogle Research Paper
Google Research Paperdidip
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
 
History page-brin thesis - anatomy of a large scale hypertextual web search...
History   page-brin thesis - anatomy of a large scale hypertextual web search...History   page-brin thesis - anatomy of a large scale hypertextual web search...
History page-brin thesis - anatomy of a large scale hypertextual web search...Bitsytask
 
Open source ml systems that need to be built
Open source ml systems that need to be builtOpen source ml systems that need to be built
Open source ml systems that need to be builtNikhil Garg
 
Managing Large Flask Applications On Google App Engine (GAE)
Managing Large Flask Applications On Google App Engine (GAE)Managing Large Flask Applications On Google App Engine (GAE)
Managing Large Flask Applications On Google App Engine (GAE)Emmanuel Olowosulu
 
Search Quality Evaluation: Tools and Techniques
Search Quality Evaluation: Tools and TechniquesSearch Quality Evaluation: Tools and Techniques
Search Quality Evaluation: Tools and TechniquesAlessandro Benedetti
 

Similar to Building multi billion ( dollars, users, documents ) search engines on open source technologies (20)

AI in Search Engines
AI in Search EnginesAI in Search Engines
AI in Search Engines
 
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Interleaving, Evaluation to Self-learning Search @904Labs
Interleaving, Evaluation to Self-learning Search @904LabsInterleaving, Evaluation to Self-learning Search @904Labs
Interleaving, Evaluation to Self-learning Search @904Labs
 
The anatomy of google
The anatomy of googleThe anatomy of google
The anatomy of google
 
ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010
 
The Searchmaster's Toolbox - David Hawking, Funnelback Search
The Searchmaster's Toolbox - David Hawking, Funnelback SearchThe Searchmaster's Toolbox - David Hawking, Funnelback Search
The Searchmaster's Toolbox - David Hawking, Funnelback Search
 
Maruti gollapudi cv
Maruti gollapudi cvMaruti gollapudi cv
Maruti gollapudi cv
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research
 
Natural Language Processing at Scale
Natural Language Processing at ScaleNatural Language Processing at Scale
Natural Language Processing at Scale
 
AI in multi billion search engines. Building AI and Search teams
AI in multi billion search engines. Building AI and Search teamsAI in multi billion search engines. Building AI and Search teams
AI in multi billion search engines. Building AI and Search teams
 
Екатерина Гордиенко (Serpstat)
Екатерина Гордиенко (Serpstat)Екатерина Гордиенко (Serpstat)
Екатерина Гордиенко (Serpstat)
 
Test
TestTest
Test
 
Google
GoogleGoogle
Google
 
Google Research Paper
Google Research PaperGoogle Research Paper
Google Research Paper
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
 
History page-brin thesis - anatomy of a large scale hypertextual web search...
History   page-brin thesis - anatomy of a large scale hypertextual web search...History   page-brin thesis - anatomy of a large scale hypertextual web search...
History page-brin thesis - anatomy of a large scale hypertextual web search...
 
Open source ml systems that need to be built
Open source ml systems that need to be builtOpen source ml systems that need to be built
Open source ml systems that need to be built
 
Managing Large Flask Applications On Google App Engine (GAE)
Managing Large Flask Applications On Google App Engine (GAE)Managing Large Flask Applications On Google App Engine (GAE)
Managing Large Flask Applications On Google App Engine (GAE)
 
Search Quality Evaluation: Tools and Techniques
Search Quality Evaluation: Tools and TechniquesSearch Quality Evaluation: Tools and Techniques
Search Quality Evaluation: Tools and Techniques
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Building multi billion ( dollars, users, documents ) search engines on open source technologies

  • 1. Building multibillion search engines on open source technologies Andrei Lopatenko, PhD Vice President of Engineering, Zillow Group
  • 2. Who am I Core contributor to Google Search (2006-2010), Apple AppStore/iTunes Co-designed and Co-implemented Apple Maps Search (2010), Walmart Grocery Has been leading search teams: Zillow (now), Walmart, eBay, PhD in Computer Science, The University of Manchester, UK My path: from core contributor of Google Search to leading search ecosystem of market leaders of Real Estate (Zillow, Trulia), eCommerce (Walmart, eBay), digital distribution (Apple), Designing, building, implementing, improving, running multi-billion search engines for last 15 years
  • 3. My goal for this talk I want to demonstrate that using open source significantly helps in ● implementation ● continuous improvement both in search infrastructure and in search quality ● support and operations of search engines for billions of users, billions of documents, (dozens) billions of revenue/GMV Typical search engine uses hundreds of open source products/libraries, I’ll focus on some of them which are proven to be useful
  • 4. How I am going to do it Show a reference architecture of a typical multibillion search engine Show typical open source based implementation Show limitations of open source implementation (more to QA) (My talk is limited to 30 min, so I’ll be brief, the topic deserves a day of tutorial) The demo will be encyclopedic in the style, let’s go deeper in QA. I’ll focus more on software used or tried in search I’ve built, rather than the comprehensive survey of every open source package available (I might be biased sometimes) and I’ll focus on what is useful for different types of search applications
  • 5. Why multibillion search engine? Many billions of what? 1. Multi billion dollars in revenue/GMV 2. Multi billion users -> Multi billions queries per day 3. Multi billion documents How these numbers are relevant to the software architecture?
  • 6. Multi billion dollars Implies 1. Potential of hundreds millions to billions dollars per year gains because of higher conversion due to better ranking, retrieval and query understanding functions, latency improvements, UX 2. continuous work on search quality, search features, search infrastructure, ranking improvements to get billion dollar gains 3. Complexity of the search stack and many of its features due to complexity of business it supports
  • 7. Multi billion users Implies 1. Multi billions queries per day 2. tough throughput requirements 3. distributed deployments for high load
  • 8. Multi billion documents 1. Distributed retrieval and ranking systems to run retrieval and ranking over billions of documents 2. Complexity of search functions (one needs good ranking if there are too many documents for a query) 3. Frequency of updates (even small update rate causes frequent updates if there are billions of documents) -> update latency for index requirements
  • 9. Are there many search engines based on open source Google, Amazon, eBay, MS/Bing use own technologies 1. For all of them, the search engine is the core of the business 2. The complexity of ranking, index etc is well beyond what open source provides 3. Huge monetary and user gains because of optimizations in the size of the index, frequency of index updates, complexity of ranking/query understanding functions, integration with other systems 4. Each company has a big search team to improve and develop their search engines Linkedin, ebay, airbnb moved from open source due to limitations of open source search engines
  • 10. Are there many search engines based on open source Walmart eCommerce - Solr moved to Elastic Apple Store, iTunes (Solr) Adobe Cloud (Elastic) Uber (Solr) Salesforce (See conferences such as Activate and Elastic{On} for details how open source is used in search
  • 11. Typical Search Engines Simple Search: small collection of document, low sensitivity to user satisfaction Search SAAS Algolia, Google Custom search SwiftType (bought by Elastic) Searchify Typical big search: hundred million/billion users, billions revenue, billions documents Open source based search engines Very big search: Billions users, 100s billions documents, high demand for search quality, huge prices for search ops Custom search engines (Google, Amazon etc) Complexity, Size, Demand for quality, Operation cost The focus of this talk
  • 12. Historical reasons are important We build custom Apple Search in 2010 (C++). Solr/Lucene was not good at the time for expected number of users (billion) and search quality. Now I would take another decision We build custom graph based search engine in 2016 (Ozlo, sold to Facebook). Open source was not good enough for qps speed for graph language we needed to build Natural language search, support of frequency of graph updates etc. Now open source solutions for graph search engines are good for reasonable requirements
  • 13. Typical Search Engine - High Level View Data Acquisition Indexing Ranking Retrieval Query Understating UX Search Assistance Logging Monitoring Experiment management SERP logic Other
  • 14. Key Main Components Aka Search Quality: Search assistance services (Autosuggest, dynamic facets etc), Query Understanding, Ranking, SERP logic (snippet building, universal search - mixing results of different corpora) Aka Search Infrastructure: retrieval, loging, monitoring, ops Aka Indexing: data acquisition (crawling web, feeds, data imports), includes data enrichment (duplicate resolution, cleaning data, mapping into common dictionaries, extraction), indexing (building index for retrieval systems)
  • 15. Query Understanding Task: process a query, parse, extract information from it, classify it, add information useful for retrieval and ranking Latency requirement: typical limit: 10 ms, up to 50ms Throughput requirement: at least qps of the system, billions per day -> 10^5 qps Frequently distributed: requires loading large vocabularies (language models, embeddings) gigabytes, naturally many components for different classification, parsing, expansion tasks (up to hundreds), heavy CPU and GPU load
  • 16. Query Understanding: Example Query: 52 inches tv samsung alexa Query Understanding: Category: TV, size: 52 inch, brand: samsung additional: alexa Relaxations: size: 48-60 inch, additional: alexa<- optional, brand: sharp, sony
  • 17. Query Understanding - Open Source 1. Spacy 2. Fasttext 3. Stanford CoreNLP 4. Apache OpenNLP 5. Google Sling Frequently low performant, one of my groups has to rewrite internals of Stanford parser to make it performant
  • 18. Query Understanding - Open Source 1. Hugging face - Transformers 2. Zalando Flair 3. Facebook PyText 4. ULMFiT 5. OpenNMT for machine translation
  • 19. Query Understanding Latency and throughput requirements are tight A lot of cashing for hard to compute models (query distribution is skewed) Fast performant models such as fasttext based has a lot of advantage
  • 20. Query Understanding - Open Source - Tools Doccano -- annotation tool, to create training sets AllenNLP as a NLP RankLab, to test models Snorkel Metal - great weak vision tools, weak super-vision is used for many practical NLP tasks
  • 21. Search Assistance Autosuggest / Type ahead , Solr has Suggester, Elastic has similar features too But building custom autosuggest on top of them is huge improvement 1. Better language modeling 2. Context (user, location) 3. Improved retrieval (substring, not prefix only, spell correction in auto suggest) All are big improvements in quality/user satisfaction Difference is big 30%+ engagement for ecommerce
  • 22. Search Assistance Filter - Facets Both Solr and Elastic generate Facets But usually it requires to build your faceting on top of them to generate faceting good for users Different might be huge - 15+% conversion for eCommerce
  • 23. Ranking - what is it? 1. a query with all information derived about the query 2. an user with all information known about the user 3. a set of retrieved results Produce the rank of results optimized for highest success Where success might be click through rate, conversion rate, GMV, revenue, other engagement, satisfaction and monetization metrics Latency requirement: typical limit: 10 ms, up to 50ms Throughput requirement: at least qps of the system, billions per day -> 10^5 qps
  • 24. Ranking - technologies - LeToR LeToR (learning to rank) / MLR (machine learning ranking) function are proven to be highly successful for ranking 1. complexity of many query/user/document features to be used for ranking 2. complexity of ranking functions 3. Turning for various metrics/re-tuning as metrics changes May require intervention (policies and law regulations search engines, multi metric optimization, internal policies not expressible in learnable functions)
  • 25. Ranking - Open Source - Training Popular approach Proprietary LeToR ‘RankLab’ based on a LTR type such as gbdt or svm rank using open source such as XGBoost, CatBoost, SVM light to learn ranking function Converting model generates by previous level into the source code of your ranking engine (java, c++) Compiling and deploying ranking function as a part of ranking layer
  • 26. Ranking - Open Source - Training 1. Solr / LTR contrib module- integrated in Solr, 2. Google TF Ranking 3. RankLib
  • 27. Ranking - Feature Storage Some ranking features are in index, many are not (due to storage limits, performance costs) Features must be retrieved during the ranking stage, for every query, ~1000+ documents during 2-3 stage of ranking, hundreds features -> Feature storage Google Feast, CouchDB etc Redis Wide column stores, document stores Optimized for multi id lookups
  • 28. Post ranking - feature layer Many search systems require post ranking layer to re-arrange, remove results, change ranking based on secondary criteria: current price which changes frequently, availability in store/warehouse which changes frequently, optimizing cross-category selling etc etc non-persistent features delivered by external systems (warehouse management, sales management) Key-value stores optimized for read-write, redis etc
  • 29. Cashing High load systems, Cashing on top of the search to keep results of frequent queries to reduce load on the system and have low latency for cached queries (no need to recompute result set). Caches are implemented in Solr, Elastic Frequently, cache is needed on the top of the search engine (all layer of rankings after L0 - personalization, diversification etc, merge of outputs of multiple solr/elastic search engines to be cashed / aka universal search) - Redis, MemCashed
  • 30. ML and NLP training / Rank Lab / ML tools Mostly for ranking and query understanding, but used for many other things too : duplicate resolution, anomaly detection Spark Platforms - getting maturity and popularity Kubeflow, ML Flow Pytorch, Tensorflow - and building your ML stack on top of them Xgboost, catboost , etc (this slide deserves a separate 1 hour talk, as all other)
  • 31. ML and NLP training / Rank Lab / ML tools R/CRAN, despite its limits, is quite useful for many search ML tasks at exploration stage and still some team use it for production learning of LeToR models My experience: Time series R modules to learn time dependencies for ranking function (for queries sensitive with fast topicality shift), R stat packages to learn language affinity features etc other ranking tasks Analysis of A/B tests, contextual bandits and other types of experiments - typically python code build on the top of scikit-learn, pandas, numpy
  • 32. Retrieval Given a query Q, return an ordered list <DocID, Score> Lucene ecosystem Solr + Elastic for text documents, text scoring, multiple fields, filters, sorts, spatial queries, Yahoo Vespa Geographic: Solr Spatial, Graphs: Neo4J, Apache Giraph, Titan DB Vector search: Facebook FAISS, MS SPTAG Google S2 geometry, JTS Topology Suite library - to implement spatial indexing/retrieval
  • 33. Retrieval But , any real search application will require a lot of optimization on top of existing system for index sizes, qps loads etc Such as sharding, (depends on search application) etc
  • 34. Retrieval - open source problems Both Solr and Elastic are good for limited cluster sizes, serious problems scaling to very large collections There is huge progress recently (2010+) in new types of index compression techniques which makes smaller index with fast retrieval - dozens of millions of dollars of hardware costs for ‘big’ search etc None of them are available in Solr/Elastic engines Default query parsers are quite limited (in 2015 multi word synonyms were not supported) -- making search engines requires a lot of work in internals of Solr/Elastic
  • 35. Indexing and Data Acquisition 1. Acquire - Get data from external sources (the web, feeds, etc etc) into internal systems. Some data might be in bulk volumes (petabytes per day - web crawl), some might be frequently updated (100s M updates per day - prices, availability) etc etc 2. Enrich - Merge, resolve duplicates, remove noise, normalize, clean, enrich, - are two pages, two items represent the same object? is latitude, longitude, consistent with the address? Map full text descriptions of item into a set of attributes with values ? Add derived signals (probability to sell, demand, similarity to other items) to the item etc etc 3. Index - Map into format understandable by retrieval systems
  • 36. Acquisition - open source Crawling web: Apache Nutch and Hetrix, for scalable, many other for smaller scale Apache Atlas / governance and discovery, data registry, data discovery A lot of domain specific data acquisition tools: ACQ4 neurophysiology, open source GISs Transport: Apache Kafka, Apache Pulsar to bring data from external systems Document stores, wide column stores: Cassandra/Scylla DB, Hbase, many other top store acquired data before indexing -- depending on needs, data types, update frequency
  • 37. Enrich Duplication resolution, cleaning, adding derived data - There is no good open source for it. A lot of workflow management , see another slide Spark/Streaming, Flink at higher level / running jobs but nothing at the task level (and task are too domain specific)
  • 38. Indexing Specific for the search and search engines provide their indexing Solr and Elastic for text search , Vespa, Facebook FAISS, MS SPTAG for similarity search Neo4G and other graph systems
  • 39. Acquisition, Enrichment, Indexing - orchestration Typically, data pipelines of hundreds individual components Integrating external data, internal data, derived data, cleaning, enriching Orchestration becomes very important, multiple jobs written in different languages, accessing different data stores, AirFlow, Oozie, even luigi - depending on complexity, environment
  • 40. Other useful stuff to build search systems Low level: gRPC, protobuf, snappy, gflags, glogs, google benchmark, etc etc etc : typically in search stack everywhere Google S2 geometry or JTS topology suite - location based search Gtest, xunitest etc testing Jaeger or Apache SkyWalking for distributed tracing
  • 41. Ops Kubernetes Knative, automated running serverless containers on k8s with autoscaling, revision tracking, Bloomberg Solr operator - to run solr on a K8S cluster Netflix’s Chaos Monkey - chaos engineering for a distributed system as search, kill instances periodically. Multibillion search- typically hundreds of services running on many nodes, the system must be fault tolerant
  • 42. Ops - Solr Cloud - tools to set up fault tolerant, high available solr cluster Elastic Search Cross Center Replication (CCR) - replication index across data centers for elastic Terraform, defining and provisioning data centers for search ops Jenkins, deployment
  • 43. Other blocks Experimentation : facebook planout, intuit wasabi, wix petri Graphical , log reports: Graphana, Kibana Logs: ELK Stack Logstash, filebeat, fluentd,
  • 44. Conversational Search Task oriented Dialog management , NLU specific for conversational tasks, slot extraction, intent classification - Rasa. You can use Rasa to manage dialog state (it is great in that from learning to manual tools to analyze dialogs), but build own NLU A plenty of research open source projects as outcomes of DSTC competitions focused on various problems in building dialog systems, dialog act classification, slot extraction, dialog break detection Nvidia Nemo if you need own Automated Speech Recognition (tuning for a specific language, domain, business )
  • 45. Conversational Search QA over paragraphs Results of academic research and competitions are frequently available online Passage reRanking with BERT - dl4Marco BERT Tanda from Alexa YodaQA cdQA DrQA Requires a lot of work to tune for your corpora
  • 46. NL Search Over structured data SQLNet Uni of Freiburg Aqqu Percy Liang’s Dependency Based Compositional Semantics Mostly academic code
  • 47. Conclusions Open Source technologies are useful to build many parts of search engine stack from low level code using libraries (ex gRPC ) to using open source systems ex: Solr/Elastic, Kafka, Tensorflow Runtime Each part will require a lot of work to tune it for your environment, code on top of it, designing ops Do not be surprised, if you have to rewrite certain open source to your implementation as you search engine grow and you will need ‘extreme’ tuning, performance, quality,
  • 48. Conclusions Search engine is a joint work of many, many people There are many people who know xgboost, catboost, tensorflow, solr, kafka, kibana Using open source simplifies hiring for expertise Search systems require continuous never ending evolution: they need to be modified and rewritten many times (module by module) - open source is a big advantage due to access to the source and community help (sometimes) Very few companies can afford search teams of hundreds/thousands people as in very big search, even 20+ billion dollar search engine teams are quite small. OS helps