Google search vs Solr search for Enterprise search
1. Presented by
Veera Shekar G
Google Search VS Advanced Search (Enterprise
Search implemtation)
8/6/2015
11/05/2015
2. • A Normal Search engine processes.
• You will understand how search Engine Works.
• I am beginner at this subject.
• 5 Top requirements for Effective Enterprise search implementation.
• Problem with implementations.
Introduction
8/6/2015
11/05/2015
3. • Topic 1: How Search engine works.
▫ Will see architecture and component details.
• Topic 2: Google Search.
▫ Phases of implementation. Indexing architecture.
• Topic 3: Top 5 requirements for implementing Enterprise search.
▫ Options available for implementations.
Session Outline
8/6/2015
11/05/2015
4. • A Normal Search Engine Architecture.
• Architecture of a search engine factors determined .
• Indexing Process.
Topic 1: Objectives
8/6/2015
11/05/2015
5. • Architecture of a search engine can be viewed
as 2 Layered
Topic 1: Content – Normal Search engine Architecture
8/6/2015
11/05/2015
6. • Architecture of a search engine determined by 2
requirements –
effectiveness (quality of results)
efficiency (response time and throughput)
Topic 1: Content - Factors
8/6/2015
11/05/2015
7. • Text acquisition –identifies and stores documents for indexing.
• Text transformation –transforms documents into index terms or features
• Index creation –
takes index terms and creates data structures (indexes) to support fast searching
Topic 1: Content
8/6/2015
11/05/2015
8. • Search engine will have two main processes Indexing process and
Querying Process.
• Questions?
Topic 1: Wrap-up
8/6/2015
11/05/2015
9. • High Level Architecture of Google search.
• Web Crawlers.
• Technologies Used.
Topic 2: Google Search
8/6/201511/05/2015
11. • A web crawler is a program that, given one or more seed URLs,
downloads the web pages associated with these URLs, extracts any
hyperlinks contained in them.
• Recursively continues to download the web pages identified by these
hyperlinks. Web crawlers are an important component of web search
engines, where they are used to collect the corpus of web pages
indexed by the search engine.
Topic 2: Content - Web Crawlers
8/6/201511/05/2015
12. • Google visualizes their infrastructure as a three layer stack:
• Products: search, advertising, email, maps, video, chat, blogger
• Distributed Systems Infrastructure: GFS, MapReduce, and BigTable.
• Computing Platforms: a bunch of machines in a bunch of different data
centers
• Make sure easy for folks in the company to deploy at a low cost.
• Look at price performance data on a per application basis. Spend more
money on hardware to not lose log data, but spend less on other types
of data. Having said that, they don't lose data.
Topic 2: Content – Technologies Stack
8/6/201511/05/2015
14. • Top 5 requirements for implementing Enterprise search.
• Options available at each requirement.
Topic 3: Objectives
8/6/201511/05/2015
15. • Diverse Content: Ability to crawl, index and search diverse content repository.
The Web, Microsoft SQL database and SharePoint content management systems.
• Secured Search: Ability to crawl secured content and make it accessible to only authorized people
and/or groups.
Single sign-on, forms-based authentication.
• User Interface: Ability to provide various user interface (UI) components to serve end users with
precise results.
Guided navigation, related search terms, related articles and best bets.
AutoSuggest with terms combined from real-time search and custom (user configurable) terms
in data stores
• Desktop Search: Ability to integrate with content stored in the desktop.
• Social Search: Ability to find other people, ratings and expertise within the organization.
Topic 3: Content - Top 5 requirements for implementing
Enterprise search
8/6/201511/05/2015
16. • Google Web crawler for crawling and indexing Web content (GOOTB).
• Google DB connector for crawling and indexing Microsoft SQL database (GOOTB).
• Google SharePoint connector for crawling and indexing SharePoint content (GOOTB).
• Google forms authentication for index time authorization and serve time authentication
(GOOTB).
• Google front-end configuration for:
> Faceted search, aka guided navigation (limited OOTB).
> Related search terms (GOOTB).
> Related articles (GOOTB).
> Best bets (GOOTB).
> Autosuggest (GOOTB and custom application).
• Google desktop search component integration (external Google component).
• Google results integration with internal rating system
Topic 3: Content – Google implementing requirements
8/6/201511/05/2015
18. • Google Web Crawler.
• Disadvantage: As efficient and good as it sounds, one disadvantage of
Web crawler is Google’s inability to reveal the exact page that is
currently being processed.
• Alternative: The OS console monitor and/ or tracking log files are some
ways that could help track URL crawl status.
• At any point of time, a developer should be able to view the current URL
being crawled and issues faced (if any) with security. Almost all tools
provide this feature – such as Solr, FAST, Endeca and Autonomy.
Topic 3: Content – Web crawler
8/6/201511/05/2015
19. • Database Connector.
• Disadvantage:
Google’s inability to allow end implementers to schedule DB crawl
Poor diagnostics for connector/XML-fed content.
Google’s way of removing content from index is quite primitive and time-consuming.
• Alternative: Alternative: Compared to GSA, It found Apache Solr is a better
option for indexing the database via data import handler.
• Solr provides an effective way to remove content from the index, either via
the admin console or via XML import (/update with delete option).
Topic 3: Content – Database Connector
8/6/201511/05/2015
20. • Google provides connectors to very few CMS systems out of the box.
• Disadvantage:
Even if Google is executing a bulk late binding, performance issues
at query time are inevitable when the document volume is high.
• Alternative: One alternate is to consider the site/page/document level
security as an additional metadata, develop an application that would
post-filter the results based on end-user security attributes. This is again
a primitive method and has its own disadvantages in terms of query
time latency.
Topic 3: Content – SharePoint Connector (for Document
Management system)
8/6/201511/05/2015
21. • At query time, Google uses the query time configuration to make an HEAD
request that would allow the logged-in user (within a specific domain) to view
only the content that he is authorized to view
.
• Disadvantage:
This late binding security model has performance degradation is
inevitable with higher QPS and/or higher results count.
• Alternative: There are tools that support an early binding security model that
allows the search engine to cache the user security groups along with the
content.
Topic 3: Content – Forms Authentication
8/6/201511/05/2015
22. • One disadvantage with Apache Solr is that it does not handle secured
content. The only way to serve secured content is to store the security
tags/groups as one of the metadata and implement a field (or
metadata) constrained search.
• That is were ACL’s come into picture.
Note
8/6/201511/05/2015
23. • GSA provides an open source component called “search-as-you-type” which
allows end implementers to fetch real-time results from the appliance.
• Disadvantage:
Onebox modules are designed to respond within one second. This could
result in no results from TermFederator if there is any delay at the
database.
• Alternative: “TermComponent” in Apache Solr is an effective autosuggest tool.
Terms stored in any local text file can be made available to Solr at startup. A
separate component designed to merge alphabetically.
Topic 3: Content – Auto Suggest
8/6/201511/05/2015
24. • Best Bets — aka Keymatches, aka AdWords.
• Related search terms same as synonyms.
• Faceted search, aka Guided Navigation: GSA does not support faceted search.
But this feature can be achieved via metadata constrained search at query time,
similar to how it is implemented in Solr.
• Disadvantage: Facet count in GSA is not available OOTB.
• Alternative: Faceted search is one of Apache Solr’s strongest features and is
implemented within many e-commerce Website
And (Oracle) Endeca and (HP) Autonomy maintain content hierarchy for guided
navigation.
Topic 3: Content – User Interface
8/6/201511/05/2015
25. • InfoValuator component captures end-user rating and saves a
combination of user identity, content URI and value rating in the backend
data store.
Topic 3: Content – InfoValuator
8/6/201511/05/2015
26. • There is no one search engine that fulfills all enterprise search
requirements. HP Autonomy claims this lofty perch but it comes with a
huge cost overhead, with the base cost crossing half a million dollars.
• Google is not the right fit for many requirements that we have seen so
far. Custom search application development is inevitable and if well
planned, we can basically use any tool in the market to implement
enterprise search as a full-fledged application.
Summary of Session
8/6/201511/05/2015
Editor's Notes
How presentation will benefit audience: Adult learners are more interested in a subject if they know how or why it is important to them.
Presenter’s level of expertise in the subject: Briefly state your credentials in this area, or explain why participants should listen to you.
Lesson descriptions should be brief.
Example objectives
At the end of this lesson, you will be able to:
Save files to the team Web server.
Move files to different locations on the team Web server.
Share files on the team Web server.