SlideShare a Scribd company logo
1 of 24
Download to read offline
Search engines in the
industry
a use case
Different interests
● researchers / engineers look for high
precision and recall
● editors / writers are concerned about
matching of queries and results
● marketers want to change / adapt results
Designing a search engine
● functional requirements
○ search
■ keywords, boolean retrieval, natural language
○ indexing
■ data sources
■ data types
○ administration
■ manage scoring / boosting functions
Designing a search engine
● architectural requirements
○ resiliency
○ scalability
○ no downtime
○ work with existing infrastructure
○ platforms
○ migrating from legacy systems
○ talk to other systems
Designing a search engine
● performance requirements
○ search
■ query per second
■ time per search request
○ index
■ document per second
■ time per indexing request
○ SLA?
Designing a search engine
● search engine performance requirements
○ recall percentiles threshold
○ precision percentiles threshold
○ minimize empty results
● often mostly unknown
○ published vs unpublished / to be written documents
● almost always umanageable
○ cannot decide when
■ it’ll be ready
■ it’ll have to be indexed
■ it’ll have to be searchable
● heterogeneous
○ different writers, languages, topics, styles, etc.
Data
Process
Project
● ~50M heterogeneous documents
● Migrating from old commercial solution to
Apache Solr
● Google like search
● Targeted search for different types of
contents
Advanced capabilities
● Smart understanding of queries
● Smart suggestion of queries
● Suggestion of similar / important contents
● Automatic classification of contents
Responsibilities
● architecture analysis and design
○ scaling under high load
● continuous definition of algorithms for
indexing and searching
● system maintenance
Skills required
● basics of information retrieval
● a bit of distributed systems
● some natural language processing
● some machine learning
Architecture analysis and design
● Shape up a prototype architecture
○ separate machines for indexing and search
○ multiple load balanced machines for searching
○ define indexing and search algorithms
● Evaluate architecture
○ stress tests (performance)
○ quality tests (accuracy)
● Iterate
Architecture analysis and design
● analyze existing documents
○ avg size
○ language
○ topics, style, etc.
● analyze existing query logs
○ avg response time
○ avg length (how much it takes to specify a query?)
○ avg query per second
Most time spent on
● testing how documents get indexed
● testing how user queries get transformer in
platform specific queries
● tweaking indexing algorithms
● tweaking search algorithms
● tweaking ranking
● platform optimization for scalability
Challenges
● Architecture constraints
● Performance
● Diverging stakeholders concerns
● Dynamically scaling search
Sample architecture constraint #1
● Data storage has to be on NFS
● Lucene is IO intensive
● NFS makes it slower
● Concurrent read writes makes it error prone
Sample architecture constraint #2
● Change search engine
● Systems talking to the SE need to switch
API
● Only in the long run
● In the short run an adapter layer for old APIs
on new APIs has to be developed
Indexing performance
● Most of the indexing time is spent converting
data from the old (indxing) format to the new
(indexing) format
● The adaption layer between old and new API
becomes the bottleneck
● Time to switch to the new API natively
Diverging concerns
● Article authors check the search engine
exactly handles their writings wanting perfect
recall and precision
○ so lot of time is spent on adjusting ranking
● Markters want to be able to overcome
ranking and put something they want to sell
○ ranking algorithm gets breached
● Need flexible algorithms
Scale dinamically
● Search engine needs not to break even
under high peaks of load
● Such peaks are often unpredictable
● Need a fast way to add more computing
power
Takeaways
● small iterations (no waterfalls!)
○ analyze portion of data / queries
○ change search / index algorithms
○ test, involve stakeholders
○ forces ability to reindex quickly
● look at data (documents, query logs)

More Related Content

Similar to Search engines in the industry

Text search with Elasticsearch on AWS
Text search with Elasticsearch on AWSText search with Elasticsearch on AWS
Text search with Elasticsearch on AWSŁukasz Przybyłek
 
Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications Paul Withers
 
Journey and evolution of Presto@Grab
Journey and evolution of Presto@GrabJourney and evolution of Presto@Grab
Journey and evolution of Presto@GrabShubham Tagra
 
OutSystems Tips and Tricks
OutSystems Tips and TricksOutSystems Tips and Tricks
OutSystems Tips and TricksOutSystems
 
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA
 
Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8Richard Boulton
 
Got documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionGot documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionMaggie Pint
 
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...Command Prompt., Inc
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Shahriar Rafee
 
A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017
A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017
A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017DevOpsDays Tel Aviv
 
The Professional Programmer
The Professional ProgrammerThe Professional Programmer
The Professional ProgrammerDave Cross
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Gabriele Bartolini
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol ValidationBIOVIA
 
Open source ml systems that need to be built
Open source ml systems that need to be builtOpen source ml systems that need to be built
Open source ml systems that need to be builtNikhil Garg
 
How to Choose the Right Database for Your Workloads
How to Choose the Right Database for Your WorkloadsHow to Choose the Right Database for Your Workloads
How to Choose the Right Database for Your WorkloadsInfluxData
 

Similar to Search engines in the industry (20)

Text search with Elasticsearch on AWS
Text search with Elasticsearch on AWSText search with Elasticsearch on AWS
Text search with Elasticsearch on AWS
 
Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications
 
Journey and evolution of Presto@Grab
Journey and evolution of Presto@GrabJourney and evolution of Presto@Grab
Journey and evolution of Presto@Grab
 
OutSystems Tips and Tricks
OutSystems Tips and TricksOutSystems Tips and Tricks
OutSystems Tips and Tricks
 
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
 
Friday 1484
Friday 1484Friday 1484
Friday 1484
 
Digital Marketing
Digital MarketingDigital Marketing
Digital Marketing
 
digital marketing
digital marketingdigital marketing
digital marketing
 
Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8
 
Got documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionGot documents - The Raven Bouns Edition
Got documents - The Raven Bouns Edition
 
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 
A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017
A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017
A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017
 
The Professional Programmer
The Professional ProgrammerThe Professional Programmer
The Professional Programmer
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
 
Open source ml systems that need to be built
Open source ml systems that need to be builtOpen source ml systems that need to be built
Open source ml systems that need to be built
 
How to Choose the Right Database for Your Workloads
How to Choose the Right Database for Your WorkloadsHow to Choose the Right Database for Your Workloads
How to Choose the Right Database for Your Workloads
 

More from Tommaso Teofili

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRTommaso Teofili
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakTommaso Teofili
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in SlingTommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr Tommaso Teofili
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache HamaTommaso Teofili
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiTommaso Teofili
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaTommaso Teofili
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on codeTommaso Teofili
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA IntroductionTommaso Teofili
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU TourTommaso Teofili
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesTommaso Teofili
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationTommaso Teofili
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the WebTommaso Teofili
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic SearchTommaso Teofili
 

More from Tommaso Teofili (19)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on code
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
 

Recently uploaded

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 

Search engines in the industry

  • 1. Search engines in the industry a use case
  • 2.
  • 3. Different interests ● researchers / engineers look for high precision and recall ● editors / writers are concerned about matching of queries and results ● marketers want to change / adapt results
  • 4. Designing a search engine ● functional requirements ○ search ■ keywords, boolean retrieval, natural language ○ indexing ■ data sources ■ data types ○ administration ■ manage scoring / boosting functions
  • 5. Designing a search engine ● architectural requirements ○ resiliency ○ scalability ○ no downtime ○ work with existing infrastructure ○ platforms ○ migrating from legacy systems ○ talk to other systems
  • 6. Designing a search engine ● performance requirements ○ search ■ query per second ■ time per search request ○ index ■ document per second ■ time per indexing request ○ SLA?
  • 7. Designing a search engine ● search engine performance requirements ○ recall percentiles threshold ○ precision percentiles threshold ○ minimize empty results
  • 8. ● often mostly unknown ○ published vs unpublished / to be written documents ● almost always umanageable ○ cannot decide when ■ it’ll be ready ■ it’ll have to be indexed ■ it’ll have to be searchable ● heterogeneous ○ different writers, languages, topics, styles, etc. Data
  • 10. Project ● ~50M heterogeneous documents ● Migrating from old commercial solution to Apache Solr ● Google like search ● Targeted search for different types of contents
  • 11. Advanced capabilities ● Smart understanding of queries ● Smart suggestion of queries ● Suggestion of similar / important contents ● Automatic classification of contents
  • 12. Responsibilities ● architecture analysis and design ○ scaling under high load ● continuous definition of algorithms for indexing and searching ● system maintenance
  • 13. Skills required ● basics of information retrieval ● a bit of distributed systems ● some natural language processing ● some machine learning
  • 14. Architecture analysis and design ● Shape up a prototype architecture ○ separate machines for indexing and search ○ multiple load balanced machines for searching ○ define indexing and search algorithms ● Evaluate architecture ○ stress tests (performance) ○ quality tests (accuracy) ● Iterate
  • 15. Architecture analysis and design ● analyze existing documents ○ avg size ○ language ○ topics, style, etc. ● analyze existing query logs ○ avg response time ○ avg length (how much it takes to specify a query?) ○ avg query per second
  • 16. Most time spent on ● testing how documents get indexed ● testing how user queries get transformer in platform specific queries ● tweaking indexing algorithms ● tweaking search algorithms ● tweaking ranking ● platform optimization for scalability
  • 17. Challenges ● Architecture constraints ● Performance ● Diverging stakeholders concerns ● Dynamically scaling search
  • 18. Sample architecture constraint #1 ● Data storage has to be on NFS ● Lucene is IO intensive ● NFS makes it slower ● Concurrent read writes makes it error prone
  • 19. Sample architecture constraint #2 ● Change search engine ● Systems talking to the SE need to switch API ● Only in the long run ● In the short run an adapter layer for old APIs on new APIs has to be developed
  • 20. Indexing performance ● Most of the indexing time is spent converting data from the old (indxing) format to the new (indexing) format ● The adaption layer between old and new API becomes the bottleneck ● Time to switch to the new API natively
  • 21. Diverging concerns ● Article authors check the search engine exactly handles their writings wanting perfect recall and precision ○ so lot of time is spent on adjusting ranking ● Markters want to be able to overcome ranking and put something they want to sell ○ ranking algorithm gets breached ● Need flexible algorithms
  • 22. Scale dinamically ● Search engine needs not to break even under high peaks of load ● Such peaks are often unpredictable ● Need a fast way to add more computing power
  • 23.
  • 24. Takeaways ● small iterations (no waterfalls!) ○ analyze portion of data / queries ○ change search / index algorithms ○ test, involve stakeholders ○ forces ability to reindex quickly ● look at data (documents, query logs)